http://www.ddj.com/hpc-high-performance-computing/208801974
in which the author claims to implement a queue using C++ std::list<>
that a single reader and a single writer can access simultaneously
without locking.
I understand the requirement that reading/writing std::list<T>::iterator
be atomic, but doesn't the implementation also require memory barriers
to prevent problems that might be caused by out-of-order writes on
multiprocessor machines?
--
Fran
Yes, the code is broken and I've already corresponded with the author
and his main pre-pub reviewer (Petru and Andrei) at length about the
details.
Since the article has already sailed, Petru is going to post a patch
with as small a diff as possible to throw in some fences. But the code
really needs a major rewrite to:
a) Address atomicity: Don't use list::iterators, which are almost
never atomically readable and assignable in practice even on x86 --
hence the disclaimer 'check to see if they are' isn't enough.
b) Address ordering: Use (preferably) C++0x's std::atomic<T>s or
(for now) nonportable features like fences or VC++ 'volatile' which
was imbued with additional ordering semantics similar to what
std::atomic<T> will have.
This week I have to write my September DDJ article, and I'm probably
going to devote it to dissecting this code and exploring a number of
ways in which it's broken. The author graciously agreed to let me do
that -- Petru (like his main reviewer Andrei) is a good guy who has
written a number of solid articles before this one, so IMO he
shouldn't be judged too harshly on this misstep. Nevertheless, yes,
the code is broken and mustn't be allowed to stand as an example.
Herb
IMO upcoming std::atomic<>'s ordering semantics suck badly. See
http://google.com/groups?as_umsgid=418AAF67...@web.de
(class mutex_and_condvar_free_single_producer_single_consumer)
for much better constraints.
Hth.
regards,
alexander.
IMO it would be better to implement the list also (without using
std::list) within the LockFreeQueue class, so that it's standalone,
and doesn't depend on the implementation of STL. (It also allows to
have some optimizations for the list handling (eg: recycling list item
holders etc.)).
So if we implement the list ourself, and simply use 'volatile' for the
shared pointers, this should work (at least on x86) without further
complications?
> > b) Address ordering: Use (preferably) C++0x's std::atomic<T>s or
> > (for now) nonportable features like fences or VC++ 'volatile' which
> > was imbued with additional ordering semantics similar to what
> > std::atomic<T> will have.
>
> IMO upcoming std::atomic<>'s ordering semantics suck badly. See
>
> http://google.com/groups?as_umsgid=418AAF67.99C5F...@web.de
> (class mutex_and_condvar_free_single_producer_single_consumer)
>
> for much better constraints.
I think implementation is buggy. There must be:
template<typename T,
bool copy_ctor_or_dtor_can_mutate_object,
bool copy_ctor_can_load_dst_object>
class mutex_and_condvar_free_single_producer_single_consumer {
...
void producer(const T & value) {
ELEM * tail = m_tail.load(msync::none);
ELEM * next = advance(tail);
while (next == m_head.load(type_list< msync::cchsb_t,
msync::ccacq_t >::
element<copy_ctor_can_load_dst_object>::type())) usleep(1000);
new(tail) T(value);
m_tail.store(next, type_list< msync::ssb_t, msync::rel_t >::
element<copy_ctor_can_load_dst_object>::type());
}
And even provided unbuggy implementation, resultant code is very
fragile. Every inoffensive modification to copy ctor or dtor can break
synchronization in queue.
This is why upcoming std::atomic doesn't provide crushed hoist/sink
fences.
But I agree, that it would be great, if one will be able to issue, for
example, 'load+hlb' or 'load+slb'.
Btw, in your proposal:
http://groups.google.com/group/comp.programming.threads/msg/b43cd6c9411c95b9
are you going to prohibit some combinations, for example, 'load+rel'
or 'store+acq' (like it's in current C++0x draft)?
Dmitriy V'jukov
Given that *tail "object" is actually undefined garbage at the point of
"new(tail) T(value)" and loads from it by copy ctor (prior to
init/stores) may well trigger undefined behaviour anyway (even without
reordering) what is the point of preventing such reordering?
> This is why upcoming std::atomic doesn't provide crushed hoist/sink
> fences.
>
> But I agree, that it would be great, if one will be able to issue, for
> example, 'load+hlb' or 'load+slb'.
>
> Btw, in your proposal:
> http://groups.google.com/group/comp.programming.threads/msg/b43cd6c9411c95b9
> are you going to prohibit some combinations, for example, 'load+rel'
> or 'store+acq' (like it's in current C++0x draft)?
No.
regards,
alexander.
No, load not from garbage, load from currently constructing object:
class Obj
{
int flag;
~Obj()
{
flag = 0; // (B)
}
Obj(Obj const& o)
{
flag = rand();
if (flag) // (A)
...
}
};
In your original code, in (A) and (B) data race takes place on 'flag'
variable. Because (A) is not happens-before (B), because produce
contains only ssb barrier, not release. Hence UB.
Dmitriy V'jukov
Ah that. Idle loads can not trigger a data race. The model says that
they don't happen.
Keep in mind that "cc_hsb" is implied for ordinary conditional stores
(side effects) and hence non-idle flag load in (A) will be ordered with
respect to "..." store (side effect) and subsequent ssb barrier.
if (flag) some_store();
<ssb barrier>
is just fine and dandy.
regards,
alexander.
What you mean by 'idle load'?
> Keep in mind that "cc_hsb" is implied for ordinary conditional stores
> (side effects) and hence non-idle flag load in (A) will be ordered with
> respect to "..." store (side effect) and subsequent ssb barrier.
>
> if (flag) some_store();
> <ssb barrier>
>
> is just fine and dandy.
I mean ordering of load of 'flag' variable in copy ctor wrt store to
'flag' variable in dtor. They not happens-before. So logically you get
UB, physically load can return value stored in dtor, not value
returned by rand().
Dmitriy V'jukov
Nope. You still need cpu memory barriers to ensure that the data being
accessed via the pointers is self-consistent. (Otherwise speculative
loads or cached stores could cause data to be seen out-of-order on other
cpus.)
Chris
Dmitriy V'jukov wrote:
>
> On Jul 7, 8:34 pm, Alexander Terekhov <terek...@web.de> wrote:
> > Dmitriy V'jukov wrote:
> >
> > [...]
> >
> >
> >
> > > class Obj
> > > {
> > > int flag;
> > > ~Obj()
> > > {
> > > flag = 0; // (B)
> > > }
> > > Obj(Obj const& o)
> > > {
> > > flag = rand();
> > > if (flag) // (A)
> > > ...
> > > }
> > > };
> >
> > > In your original code, in (A) and (B) data race takes place on 'flag'
> > > variable. Because (A) is not happens-before (B), because produce
> > > contains only ssb barrier, not release. Hence UB.
> >
> > Ah that. Idle loads can not trigger a data race. The model says that
> > they don't happen.
>
> What you mean by 'idle load'?
For example
if (flag ) ;
>
> > Keep in mind that "cc_hsb" is implied for ordinary conditional stores
> > (side effects) and hence non-idle flag load in (A) will be ordered with
> > respect to "..." store (side effect) and subsequent ssb barrier.
> >
> > if (flag) some_store();
> > <ssb barrier>
> >
> > is just fine and dandy.
>
> I mean ordering of load of 'flag' variable in copy ctor wrt store to
> 'flag' variable in dtor. They not happens-before. So logically you get
> UB, physically load can return value stored in dtor, not value
> returned by rand().
ssb above orders some_store() so that it happens before dtor. There is
implicit conditional ordering between some_store() and flag load such
that flag load happens before store and hence also before dtor.
regards,
alexander.
> > I mean ordering of load of 'flag' variable in copy ctor wrt store to
> > 'flag' variable in dtor. They not happens-before. So logically you get
> > UB, physically load can return value stored in dtor, not value
> > returned by rand().
>
> ssb above orders some_store() so that it happens before dtor. There is
> implicit conditional ordering between some_store() and flag load such
> that flag load happens before store and hence also before dtor.
Ok. Bad example. Consider following example:
atomic<int> g_something;
class Obj
{
int flag;
~Obj()
{
flag = 0; // (B)
}
Obj(Obj const& o)
{
flag = rand();
g_something += flag;
}
};
Dmitriy V'jukov
Ok, I can rewrite example this way:
volatile int flag;
...
if (flag); // <--- load of 'flag' is visible side effect here
or this way:
int flag;
...
volatile int local = flag; // <--- load is not 'idle' here
Btw, can you provide some reference, which states that "Idle loads can
not trigger a data race and that they don't happen"? This statement is
not clear for me. I agree that, 'idle' load is not visible side
effect, and optionally can be omitted. But I'm not sure, that 'idle'
loads MUST not happen at all, and thus can't trigger a data race.
Dmitriy V'jukov
That would be either g_something.store(g_something.load() + flag) or
g_something.add(flag), right?
Are you saying that such load(s) can be said to "keep loading" after
g_something.store() or .add() completes... thus creating a race in spite
of subsequent ssb (it orders .add() as well as .store())? Well, that
would be an "idle" part of sorta "prolonged loading" and thus we can
safely ignore it. ;-)
regards,
alexander.
Visible in a debugger? ;-)
>
> or this way:
> int flag;
> ...
> volatile int local = flag; // <--- load is not 'idle' here
In both cases flag value will either be written to some memory location
or fetched and kept in a register without involving spill to memory. In
the case of spill to memory, store dependency on load will order load
with respect to ssb and dtor. Without spill to memory, *implementation*
would have to treat ssb as rel.
regards,
alexander.
First of all, Hans-J. Boehm doesn't agree with you in your reasoning -
see:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2176.html
(part "Why ordering constraints are never limited to loads or stores" -
> "Recipient writes to object")
Ok, let's play by your rules:
volatile int g_device_register;
class Obj
{
int flag;
~Obj()
{
flag = 0; // (B)
}
Obj(Obj const& o)
{
flag = rand();
if (flag)
// load from device register triggers some action in physical
device
g_device_register + 0;
}
};
Dmitriy V'jukov
The key in his disagreement that he totally ignores dependencies --
"we've argued separately that it doesn't make sense to enforce ordering
based on dependencies" (basing his argument on examples with fake
dependencies, load dependencies, and etc. "dependencies" all missing the
point). I'm talking about stores dependent on loads. Not rocket science
at all.
>
> Ok, let's play by your rules:
>
> volatile int g_device_register;
That's just can't be a device register. For device register you'd need a
pointer to memory mapped registers space and not ordinary volatile int
object of static storage duration.
> class Obj
> {
> int flag;
> ~Obj()
> {
> flag = 0; // (B)
> }
> Obj(Obj const& o)
> {
> flag = rand();
> if (flag)
> // load from device register triggers some action in physical
> device
> g_device_register + 0;
> }
> };
There was a proposal in some C++ TR suggesting <iohw.h> and <hardware>
interface for a la g_device_register stuff. Go google it. That would be
the right place to specify "interspace" barriers between ordinary memory
and memory mapped device space. As far as atomic<> is concerned, memory
mapped device space simply doesn't exist. Next question. ;-)
regards,
alexander.