Lock-free queue with single reader and single writer

Francis Litterio

unread,

Jul 2, 2008, 10:57:39 AM7/2/08

to

Dr Dobb's Portal recently published this article ("Lock-Free Queues"):

http://www.ddj.com/hpc-high-performance-computing/208801974

in which the author claims to implement a queue using C++ std::list<>
that a single reader and a single writer can access simultaneously
without locking.

I understand the requirement that reading/writing std::list<T>::iterator
be atomic, but doesn't the implementation also require memory barriers
to prevent problems that might be caused by out-of-order writes on
multiprocessor machines?
--
Fran

hsu...@gotw.ca

unread,

Jul 2, 2008, 12:40:02 PM7/2/08

to

Yes, the code is broken and I've already corresponded with the author
and his main pre-pub reviewer (Petru and Andrei) at length about the
details.

Since the article has already sailed, Petru is going to post a patch
with as small a diff as possible to throw in some fences. But the code
really needs a major rewrite to:

a) Address atomicity: Don't use list::iterators, which are almost
never atomically readable and assignable in practice even on x86 --
hence the disclaimer 'check to see if they are' isn't enough.

b) Address ordering: Use (preferably) C++0x's std::atomic<T>s or
(for now) nonportable features like fences or VC++ 'volatile' which
was imbued with additional ordering semantics similar to what
std::atomic<T> will have.

This week I have to write my September DDJ article, and I'm probably
going to devote it to dissecting this code and exploring a number of
ways in which it's broken. The author graciously agreed to let me do
that -- Petru (like his main reviewer Andrei) is a good guy who has
written a number of solid articles before this one, so IMO he
shouldn't be judged too harshly on this misstep. Nevertheless, yes,
the code is broken and mustn't be allowed to stand as an example.

Herb

Alexander Terekhov

unread,

Jul 2, 2008, 1:31:11 PM7/2/08

to

hsu...@gotw.ca wrote:
>
> On Jul 2, 7:57 am, Francis Litterio <em...@not.available> wrote:
> > Dr Dobb's Portal recently published this article ("Lock-Free Queues"):
> >
> > http://www.ddj.com/hpc-high-performance-computing/208801974
> >
> > in which the author claims to implement a queue using C++ std::list<>
> > that a single reader and a single writer can access simultaneously
> > without locking.
> >
> > I understand the requirement that reading/writing std::list<T>::iterator
> > be atomic, but doesn't the implementation also require memory barriers
> > to prevent problems that might be caused by out-of-order writes on
> > multiprocessor machines?
>
> Yes, the code is broken and I've already corresponded with the author
> and his main pre-pub reviewer (Petru and Andrei) at length about the
> details.
>
> Since the article has already sailed, Petru is going to post a patch
> with as small a diff as possible to throw in some fences. But the code
> really needs a major rewrite to:
>
> a) Address atomicity: Don't use list::iterators, which are almost
> never atomically readable and assignable in practice even on x86 --
> hence the disclaimer 'check to see if they are' isn't enough.
>
> b) Address ordering: Use (preferably) C++0x's std::atomic<T>s or
> (for now) nonportable features like fences or VC++ 'volatile' which
> was imbued with additional ordering semantics similar to what
> std::atomic<T> will have.

IMO upcoming std::atomic<>'s ordering semantics suck badly. See

http://google.com/groups?as_umsgid=418AAF67...@web.de
(class mutex_and_condvar_free_single_producer_single_consumer)

for much better constraints.

Hth.

regards,
alexander.

wolfie2x

unread,

Jul 5, 2008, 5:20:23 AM7/5/08

to

IMO it would be better to implement the list also (without using
std::list) within the LockFreeQueue class, so that it's standalone,
and doesn't depend on the implementation of STL. (It also allows to
have some optimizations for the list handling (eg: recycling list item
holders etc.)).
So if we implement the list ourself, and simply use 'volatile' for the
shared pointers, this should work (at least on x86) without further
complications?

Dmitriy V'jukov

unread,

Jul 7, 2008, 9:32:21 AM7/7/08

to

On Jul 2, 9:31 pm, Alexander Terekhov <terek...@web.de> wrote:

> > b) Address ordering: Use (preferably) C++0x's std::atomic<T>s or
> > (for now) nonportable features like fences or VC++ 'volatile' which
> > was imbued with additional ordering semantics similar to what
> > std::atomic<T> will have.
>
> IMO upcoming std::atomic<>'s ordering semantics suck badly. See
>

> http://google.com/groups?as_umsgid=418AAF67.99C5F...@web.de

> (class mutex_and_condvar_free_single_producer_single_consumer)
>
> for much better constraints.

I think implementation is buggy. There must be:

template<typename T,
bool copy_ctor_or_dtor_can_mutate_object,
bool copy_ctor_can_load_dst_object>
class mutex_and_condvar_free_single_producer_single_consumer {

...

void producer(const T & value) {
ELEM * tail = m_tail.load(msync::none);
ELEM * next = advance(tail);
while (next == m_head.load(type_list< msync::cchsb_t,
msync::ccacq_t >::
element<copy_ctor_can_load_dst_object>::type())) usleep(1000);
new(tail) T(value);
m_tail.store(next, type_list< msync::ssb_t, msync::rel_t >::
element<copy_ctor_can_load_dst_object>::type());
}

And even provided unbuggy implementation, resultant code is very
fragile. Every inoffensive modification to copy ctor or dtor can break
synchronization in queue.
This is why upcoming std::atomic doesn't provide crushed hoist/sink
fences.

But I agree, that it would be great, if one will be able to issue, for
example, 'load+hlb' or 'load+slb'.

Btw, in your proposal:
http://groups.google.com/group/comp.programming.threads/msg/b43cd6c9411c95b9
are you going to prohibit some combinations, for example, 'load+rel'
or 'store+acq' (like it's in current C++0x draft)?

Dmitriy V'jukov

Alexander Terekhov

unread,

Jul 7, 2008, 10:05:58 AM7/7/08

to

Given that *tail "object" is actually undefined garbage at the point of
"new(tail) T(value)" and loads from it by copy ctor (prior to
init/stores) may well trigger undefined behaviour anyway (even without
reordering) what is the point of preventing such reordering?

> This is why upcoming std::atomic doesn't provide crushed hoist/sink
> fences.
>
> But I agree, that it would be great, if one will be able to issue, for
> example, 'load+hlb' or 'load+slb'.
>
> Btw, in your proposal:
> http://groups.google.com/group/comp.programming.threads/msg/b43cd6c9411c95b9
> are you going to prohibit some combinations, for example, 'load+rel'
> or 'store+acq' (like it's in current C++0x draft)?

No.

regards,
alexander.

Dmitriy V'jukov

unread,

Jul 7, 2008, 11:08:03 AM7/7/08

to

No, load not from garbage, load from currently constructing object:

class Obj
{
int flag;
~Obj()
{
flag = 0; // (B)
}
Obj(Obj const& o)
{
flag = rand();
if (flag) // (A)
...
}
};

In your original code, in (A) and (B) data race takes place on 'flag'
variable. Because (A) is not happens-before (B), because produce
contains only ssb barrier, not release. Hence UB.

Dmitriy V'jukov

Alexander Terekhov

unread,

Jul 7, 2008, 12:34:28 PM7/7/08

to

Dmitriy V'jukov wrote:
[...]

> class Obj
> {
> int flag;
> ~Obj()
> {
> flag = 0; // (B)
> }
> Obj(Obj const& o)
> {
> flag = rand();
> if (flag) // (A)
> ...
> }
> };
>
> In your original code, in (A) and (B) data race takes place on 'flag'
> variable. Because (A) is not happens-before (B), because produce
> contains only ssb barrier, not release. Hence UB.

Ah that. Idle loads can not trigger a data race. The model says that
they don't happen.

Keep in mind that "cc_hsb" is implied for ordinary conditional stores
(side effects) and hence non-idle flag load in (A) will be ordered with
respect to "..." store (side effect) and subsequent ssb barrier.

if (flag) some_store();
<ssb barrier>

is just fine and dandy.

regards,
alexander.

Dmitriy V'jukov

unread,

Jul 7, 2008, 12:53:09 PM7/7/08

to

On Jul 7, 8:34 pm, Alexander Terekhov <terek...@web.de> wrote:
> Dmitriy V'jukov wrote:
>
> [...]
>
>
>
> > class Obj
> > {
> > int flag;
> > ~Obj()
> > {
> > flag = 0; // (B)
> > }
> > Obj(Obj const& o)
> > {
> > flag = rand();
> > if (flag) // (A)
> > ...
> > }
> > };
>
> > In your original code, in (A) and (B) data race takes place on 'flag'
> > variable. Because (A) is not happens-before (B), because produce
> > contains only ssb barrier, not release. Hence UB.
>
> Ah that. Idle loads can not trigger a data race. The model says that
> they don't happen.

What you mean by 'idle load'?

> Keep in mind that "cc_hsb" is implied for ordinary conditional stores
> (side effects) and hence non-idle flag load in (A) will be ordered with
> respect to "..." store (side effect) and subsequent ssb barrier.
>
> if (flag) some_store();
> <ssb barrier>
>
> is just fine and dandy.

I mean ordering of load of 'flag' variable in copy ctor wrt store to
'flag' variable in dtor. They not happens-before. So logically you get
UB, physically load can return value stored in dtor, not value
returned by rand().

Dmitriy V'jukov

Chris Friesen

unread,

Jul 7, 2008, 1:37:29 PM7/7/08

to wolfie2x

wolfie2x wrote:
> So if we implement the list ourself, and simply use 'volatile' for the
> shared pointers, this should work (at least on x86) without further
> complications?

Nope. You still need cpu memory barriers to ensure that the data being
accessed via the pointers is self-consistent. (Otherwise speculative
loads or cached stores could cause data to be seen out-of-order on other
cpus.)

Chris

Alexander Terekhov

unread,

Jul 7, 2008, 2:00:14 PM7/7/08

to

Dmitriy V'jukov wrote:
>
> On Jul 7, 8:34 pm, Alexander Terekhov <terek...@web.de> wrote:
> > Dmitriy V'jukov wrote:
> >
> > [...]
> >
> >
> >
> > > class Obj
> > > {
> > > int flag;
> > > ~Obj()
> > > {
> > > flag = 0; // (B)
> > > }
> > > Obj(Obj const& o)
> > > {
> > > flag = rand();
> > > if (flag) // (A)
> > > ...
> > > }
> > > };
> >
> > > In your original code, in (A) and (B) data race takes place on 'flag'
> > > variable. Because (A) is not happens-before (B), because produce
> > > contains only ssb barrier, not release. Hence UB.
> >
> > Ah that. Idle loads can not trigger a data race. The model says that
> > they don't happen.
>
> What you mean by 'idle load'?

For example

if (flag ) ;

>
> > Keep in mind that "cc_hsb" is implied for ordinary conditional stores
> > (side effects) and hence non-idle flag load in (A) will be ordered with
> > respect to "..." store (side effect) and subsequent ssb barrier.
> >
> > if (flag) some_store();
> > <ssb barrier>
> >
> > is just fine and dandy.
>
> I mean ordering of load of 'flag' variable in copy ctor wrt store to
> 'flag' variable in dtor. They not happens-before. So logically you get
> UB, physically load can return value stored in dtor, not value
> returned by rand().

ssb above orders some_store() so that it happens before dtor. There is
implicit conditional ordering between some_store() and flag load such
that flag load happens before store and hence also before dtor.

regards,
alexander.

Dmitriy V'jukov

unread,

Jul 7, 2008, 2:07:22 PM7/7/08

to

On 7 июл, 22:00, Alexander Terekhov <terek...@web.de> wrote:

> > I mean ordering of load of 'flag' variable in copy ctor wrt store to
> > 'flag' variable in dtor. They not happens-before. So logically you get
> > UB, physically load can return value stored in dtor, not value
> > returned by rand().
>
> ssb above orders some_store() so that it happens before dtor. There is
> implicit conditional ordering between some_store() and flag load such
> that flag load happens before store and hence also before dtor.

Ok. Bad example. Consider following example:

atomic<int> g_something;

class Obj
{
int flag;
~Obj()
{
flag = 0; // (B)
}
Obj(Obj const& o)
{
flag = rand();

g_something += flag;
}
};

Dmitriy V'jukov

unread,

Jul 7, 2008, 2:18:19 PM7/7/08

to

On 7 июл, 22:00, Alexander Terekhov <terek...@web.de> wrote:

> > > > In your original code, in (A) and (B) data race takes place on 'flag'
> > > > variable. Because (A) is not happens-before (B), because produce
> > > > contains only ssb barrier, not release. Hence UB.
>
> > > Ah that. Idle loads can not trigger a data race. The model says that
> > > they don't happen.
>
> > What you mean by 'idle load'?
>
> For example
>
> if (flag ) ;

Ok, I can rewrite example this way:

volatile int flag;
...
if (flag); // <--- load of 'flag' is visible side effect here

or this way:
int flag;
...
volatile int local = flag; // <--- load is not 'idle' here

Btw, can you provide some reference, which states that "Idle loads can
not trigger a data race and that they don't happen"? This statement is
not clear for me. I agree that, 'idle' load is not visible side
effect, and optionally can be omitted. But I'm not sure, that 'idle'
loads MUST not happen at all, and thus can't trigger a data race.

Dmitriy V'jukov

Alexander Terekhov

unread,

Jul 7, 2008, 2:35:05 PM7/7/08

to

Dmitriy V'jukov wrote:
[...]

> atomic<int> g_something;
> class Obj
> {
> int flag;
> ~Obj()
> {
> flag = 0; // (B)
> }
> Obj(Obj const& o)
> {
> flag = rand();
> g_something += flag;
> }
> };

That would be either g_something.store(g_something.load() + flag) or
g_something.add(flag), right?

Are you saying that such load(s) can be said to "keep loading" after
g_something.store() or .add() completes... thus creating a race in spite
of subsequent ssb (it orders .add() as well as .store())? Well, that
would be an "idle" part of sorta "prolonged loading" and thus we can
safely ignore it. ;-)

regards,
alexander.

Alexander Terekhov

unread,

Jul 7, 2008, 3:11:10 PM7/7/08

to

Dmitriy V'jukov wrote:

>
> On 7 ��, 22:00, Alexander Terekhov <terek...@web.de> wrote:
> > > > > In your original code, in (A) and (B) data race takes place on 'flag'
> > > > > variable. Because (A) is not happens-before (B), because produce
> > > > > contains only ssb barrier, not release. Hence UB.
> >
> > > > Ah that. Idle loads can not trigger a data race. The model says that
> > > > they don't happen.
> >
> > > What you mean by 'idle load'?
> >
> > For example
> >
> > if (flag ) ;
>
> Ok, I can rewrite example this way:
>
> volatile int flag;
> ...
> if (flag); // <--- load of 'flag' is visible side effect here

Visible in a debugger? ;-)

>
> or this way:
> int flag;
> ...
> volatile int local = flag; // <--- load is not 'idle' here

In both cases flag value will either be written to some memory location
or fetched and kept in a register without involving spill to memory. In
the case of spill to memory, store dependency on load will order load
with respect to ssb and dtor. Without spill to memory, *implementation*
would have to treat ssb as rel.

regards,
alexander.

Dmitriy V'jukov

unread,

Jul 8, 2008, 7:09:07 AM7/8/08

to

First of all, Hans-J. Boehm doesn't agree with you in your reasoning -
see:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2176.html
(part "Why ordering constraints are never limited to loads or stores" -
> "Recipient writes to object")

Ok, let's play by your rules:

volatile int g_device_register;

class Obj
{
int flag;
~Obj()
{
flag = 0; // (B)
}
Obj(Obj const& o)
{
flag = rand();

if (flag)
// load from device register triggers some action in physical
device
g_device_register + 0;
}
};

Dmitriy V'jukov

Alexander Terekhov

unread,

Jul 8, 2008, 8:11:09 AM7/8/08

to

Dmitriy V'jukov wrote:
[...]

> First of all, Hans-J. Boehm doesn't agree with you in your reasoning -
> see:
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2176.html
> (part "Why ordering constraints are never limited to loads or stores" -
> > "Recipient writes to object")

The key in his disagreement that he totally ignores dependencies --
"we've argued separately that it doesn't make sense to enforce ordering
based on dependencies" (basing his argument on examples with fake
dependencies, load dependencies, and etc. "dependencies" all missing the
point). I'm talking about stores dependent on loads. Not rocket science
at all.

>
> Ok, let's play by your rules:
>
> volatile int g_device_register;

That's just can't be a device register. For device register you'd need a
pointer to memory mapped registers space and not ordinary volatile int
object of static storage duration.

> class Obj
> {
> int flag;
> ~Obj()
> {
> flag = 0; // (B)
> }
> Obj(Obj const& o)
> {
> flag = rand();
> if (flag)
> // load from device register triggers some action in physical
> device
> g_device_register + 0;
> }
> };

There was a proposal in some C++ TR suggesting <iohw.h> and <hardware>
interface for a la g_device_register stuff. Go google it. That would be
the right place to specify "interspace" barriers between ordinary memory
and memory mapped device space. As far as atomic<> is concerned, memory
mapped device space simply doesn't exist. Next question. ;-)

regards,
alexander.