Multi-producer/multi-consumer SEH-based queue

Dmitriy V'jukov

unread,

Jul 30, 2009, 7:57:30 PM7/30/09

to

Here is my multi-producer/single-consumer queue:
http://groups.google.ru/group/lock-free/browse_frm/thread/55df71b87acb8201
The interesting part of the algorithm is an XCHG-based producer part.

As Chris Thomasson correctly noted, the XCHG-based producer part can
be combined with the well-known CAS-based consumer part in order to
get multi-producer/multi-consumer (MPMC) queue:
http://groups.google.ru/group/comp.programming.threads/browse_frm/thread/053e322ea90e4ad5

In general, one may combine different producer and consumer parts into
single queue provided that queue structure stays the same. For
example, it's possible to combine the XCHG-based producer part with
consumer part from 2-LOCK queue algorithm. The resulting 1-LOCK/1-XCHG
queue will have quite appealing characteristics (wait-free producers,
and 1 spin-lock acquisition for consumers, no need for ABA prevention
nor safe memory reclamation).

But what I'm going to show is a bit more interesting, it's a novel
consumer part for MPMC queue.
The problem with classical CAS-based MPMC queue is that both producers
and consumers may touch potentially freed/reused memory, moreover
producers may write to that memory. That's why it requires safe memory
reclamation (SMR), and in general SMR is quite problematic in a non-
managed non-kernel environment (C/C++).
XCHG-based producer part gracefully avoids touching freed/reused
memory. So now the problem is with consumer part only, but note that
consumers may only read from freed/reused (no writes to that memory).
The key point of the proposed algorithm is handling of reads from
reused memory with failing CAS, and handling of reads from freed
memory with SEH/signal handler.
Main characteristics of the algorithm:
- intrusive
- producers: 1 XCHG, wait-free
- consumers: 1 CAS on common path, mostly lock-free (***)
- producers and consumers do not content with each other (until queue
is empty)
- no need for safe memory reclamation

(***) requires additional comments. There is a small (1 machine
instruction in length) window of inconsistency for producers. If
producer will be preempted there he may (or may not) cause blocking of
consumers (other producers are still wait-free). If producer will be
terminated there he will cause system-wide stall. Taking into account
length of the window, probability of these things may be considered
negligible in most situations.

The algorithm requires double-word CAS (for pointer + ABA counter). On
64-bit systems it may be reduced to single-word (64-bit) CAS with
pointer packing technique. For example, on Intel64/Windows any aligned
pointer may be packed to 39 bits, this allows for 25-bit ABA counter.

OK, here we go:

/* Multi-producer/multi-consumer queue
* 2009, Dmitriy V'yukov
* Distributed under the terms of the GNU General Public License
* as published by the Free Software Foundation,
* either version 3 of the License,
* or (at your option) any later version.
* See: http://www.gnu.org/licenses
*/

// 32-bit, Windows, MSVC
#include <windows.h>
#include <intrin.h>

class mpmc_queue
{
public:
struct node_t
{
node_t* volatile next_;
};

mpmc_queue()
{
head_.ptr_ = 0;
head_.cnt_ = 0;
tail_ = &head_.ptr_;
}

~mpmc_queue()
{
ASSERT(head_.ptr_ == 0);
ASSERT(tail_ == &head_.ptr_);
}

void enqueue(node_t* node)
{
ASSERT(node);
node->next_ = 0;
node_t** prev = (node_t**)
_InterlockedExchange((long*)&tail_, (long)node);
ASSERT(prev);
// <--- the window of inconsistency is HERE (***)
prev[0] = node;
}

node_t* dequeue()
{
unsigned retry_count = 0;
retry:
__try
{
head_t h;
h.ptr_= head_.ptr_;
h.cnt_ = head_.cnt_;
for (;;)
{
node_t* n = h.ptr_;
if (n == 0)
return 0;
if (n->next_)
{
head_t xchg = {n->next_, h.cnt_ + 1};
__int64 prev_raw =
_InterlockedCompareExchange64
(&head_.whole_, xchg.whole_, h.whole_);
head_t prev = *(head_t*)&prev_raw;
if (*(__int64*)&prev == *(__int64*)&h)
return n;
h.ptr_ = prev.ptr_;
h.cnt_ = prev.cnt_;
}
else
{
node_t* t = (node_t*)tail_;
if (n != t)
{
// spinning here may only be caused
// by producer preempted in (***)
SwitchToThread();
h.ptr_= head_.ptr_;
h.cnt_ = head_.cnt_;
continue;
}
head_t xchg = {0, h.cnt_ + 1};
head_t prev;
prev.whole_ = _InterlockedCompareExchange64
(&head_.whole_, xchg.whole_, h.whole_);
if (prev.whole_ == h.whole_)
{
node_t* prev_tail = (node_t*)
_InterlockedCompareExchange
((long*)&tail_, (long)&head_.ptr_, (long)n);
if (prev_tail == n)
return n;
// spinning here may only be caused
// by producer preempted in (***)
while (n->next_ == 0)
SwitchToThread();
head_.ptr_ = n->next_;
return n;
}
h.ptr_ = prev.ptr_;
h.cnt_ = prev.cnt_;
}
}
}
__except ((GetExceptionCode() == EXCEPTION_ACCESS_VIOLATION
&& ++retry_count < 64*1024) ?
EXCEPTION_EXECUTE_HANDLER : EXCEPTION_CONTINUE_SEARCH)
{
goto retry;
}
}

private:
union head_t
{
struct
{
node_t* ptr_;
unsigned cnt_;
};
__int64 whole_;
};

head_t volatile head_;
char pad_ [64];
node_t* volatile* volatile tail_;

mpmc_queue(mpmc_queue const&);
mpmc_queue& operator = (mpmc_queue const&);
};

And here is a small test:

/* Multi-producer/multi-consumer queue
* 2009, Dmitriy V'yukov
* Distributed under the terms of the GNU General Public License
* as published by the Free Software Foundation,
* either version 3 of the License,
* or (at your option) any later version.
* See: http://www.gnu.org/licenses
*/

size_t const thread_count = 8;
size_t const batch_size = 32;
size_t const iter_count = 400000;

bool volatile g_start = 0;

struct my_node : mpmc_queue::node_t
{
int data;
char pad [64];
};

unsigned __stdcall thread_func(void* ctx)
{
mpmc_queue& queue = *(mpmc_queue*)ctx;

srand((unsigned)time(0) + GetCurrentThreadId());
size_t pause = rand() % 1000;

my_node* node_cache [batch_size];
for (size_t i = 0; i != batch_size; i += 1)
{
node_cache[i] = new my_node;
node_cache[i]->data = i;
}

while (g_start == 0)
SwitchToThread();

for (size_t i = 0; i != pause; i += 1)
_mm_pause();

for (int iter = 0; iter != iter_count; ++iter)
{
for (size_t i = 0; i != batch_size; i += 1)
{
queue.enqueue(node_cache[i]);
}
for (size_t i = 0; i != batch_size; i += 1)
{
for (;;)
{
my_node* node = (my_node*)queue.dequeue();
if (node)
{
node_cache[i] = node;
break;
}
SwitchToThread();
}
}
}

return 0;
}

int main()
{
mpmc_queue queue;

HANDLE threads [thread_count];
for (int i = 0; i != thread_count; ++i)
{
threads[i] = (HANDLE)_beginthreadex
(0, 0, thread_func, &queue, 0, 0);
}

Sleep(1);

unsigned __int64 start = __rdtsc();
g_start = 1;

WaitForMultipleObjects(thread_count, threads, 1, INFINITE);

unsigned __int64 end = __rdtsc();
unsigned __int64 time = end - start;
std::cout << "cycles/op=" << time /
batch_size * iter_count * 2 * thread_count)
<< std::endl;
}

--
Dmitriy V'yukov

Chris M. Thomasson

unread,

Jul 30, 2009, 8:39:20 PM7/30/09

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:8487febc-e1bd-4b3b...@w41g2000yqb.googlegroups.com...

> Here is my multi-producer/single-consumer queue:
> http://groups.google.ru/group/lock-free/browse_frm/thread/55df71b87acb8201
> The interesting part of the algorithm is an XCHG-based producer part.
>
> As Chris Thomasson correctly noted, the XCHG-based producer part can
> be combined with the well-known CAS-based consumer part in order to
> get multi-producer/multi-consumer (MPMC) queue:
> http://groups.google.ru/group/comp.programming.threads/browse_frm/thread/053e322ea90e4ad5

[...]

> node_t* dequeue()
> {
> unsigned retry_count = 0;
> retry:
> __try
> {
> head_t h;
> h.ptr_= head_.ptr_;
> h.cnt_ = head_.cnt_;

[...]

I am not sure about this yet but I think that you might need to load the ABA
counter _before_ the head pointer. I know that this is requirement for
lock-free stack when one uses normal CAS for producer side of the algorithm.
Here is how a such a lock-free stack can bite the dust if the ABA counter is
loaded _after_ the head pointer:
___________________________________________________________________
SC0: loop
SC1: head = lf->top
SC2: oc = lf->ocount
SC3: if head == NULL
SC4: return NULL
SC5: next = head->next
SC6: if DWCAS(&lf->top, <head, oc>, <next, oc + 1>)
SC7: break
SC8: endloop
SC9: return head

Think of current state being top = pointerA and oc = 1 and threadA loads
pointerA; get premepted by threadB which pops pointerA, thus increment the
oc from 1 to 2; ThreadA resumes and loads version 2 and gets to line SC5
thus reading NULL as a next pointer; gets preempted by threadB which pushes
pointerB, and then pointerA onto the list mutating the state to top =
pointerA and oc = 2 with pointerA having a next node equal to pointerB;
ThreadA resumes and performs a successful DWCAS operation when it should
have failed thus swapping the top of the list with NULL and all the nodes
get lost!
___________________________________________________________________

Not sure if this applies to the DWCAS portion of the MPMC queue.

Chris M. Thomasson

unread,

Jul 30, 2009, 8:45:19 PM7/30/09

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message
news:8487febc-e1bd-4b3b...@w41g2000yqb.googlegroups.com...

[...]

I don't worry about the (***) part as it's an EXTREMELY small window. Also,
if a programs threads can just up and die at (***) then, IMHO, the program
has a lot more "issues" to worry about than this one...

;^)

So, we now have intrusive/non-intrusive versions of a high-performance MPMC
queue that beats the shi% out of the Michael and Scott queue:

Very cool!

Dmitriy V'jukov

unread,

Jul 30, 2009, 8:48:57 PM7/30/09

to

On 31 июл, 04:39, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> "Dmitriy V'jukov" <dvyu...@gmail.com> wrote in message

>
> news:8487febc-e1bd-4b3b...@w41g2000yqb.googlegroups.com...
>
> > Here is my multi-producer/single-consumer queue:

> >http://groups.google.ru/group/lock-free/browse_frm/thread/55df71b87ac...

> > The interesting part of the algorithm is an XCHG-based producer part.
>
> > As Chris Thomasson correctly noted, the XCHG-based producer part can
> > be combined with the well-known CAS-based consumer part in order to
> > get multi-producer/multi-consumer (MPMC) queue:

> >http://groups.google.ru/group/comp.programming.threads/browse_frm/thr...

I don't see now how this may break my algorithm. I think that Relacy
will show the truth.

--
Dmitriy V'jukov

Chris M. Thomasson

unread,

Jul 30, 2009, 9:07:53 PM7/30/09

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:c872ad32-6a06-4696...@r2g2000yqm.googlegroups.com...

On 31 июл, 04:39, "Chris M. Thomasson" <n...@spam.invalid> wrote:
> > "Dmitriy V'jukov" <dvyu...@gmail.com> wrote in message

[...]
> > I am not sure about this yet but I think that you might need to load the
> > ABA
> > counter _before_ the head pointer. I know that this is requirement for
> > lock-free stack when one uses normal CAS for producer side of the
> > algorithm.
> > Here is how a such a lock-free stack can bite the dust if the ABA
> > counter is
> > loaded _after_ the head pointer:
> > ___________________________________________________________________

[...]

> > ___________________________________________________________________
> >
> > Not sure if this applies to the DWCAS portion of the MPMC queue.

> I don't see now how this may break my algorithm.

I can't come up with anything yet. It's probably a complete false-alarm.

> I think that Relacy will show the truth.

I bet it will. I know that this is a problem for the lock-free stack. Humm,
perhaps I can see if Relacy can detect the error on the stack algorithm. Its
a very subtle race-condition, but it will destroy the stack and loose nodes.
Relacy should detect this as a memory leak. One problem, I am not sure how
to model this moment in Relacy. I need to create a special struct for a
DWCAS anchor in order to be compatible with rl::atomic. However, once I do
that, rl::atomic will load the entire anchor atomically with no chance of a
preemption in between the load to the anchors head pointer and the
subsequent load to the anchors aba counter. Am I making any sense to you?

Dmitriy V'jukov

unread,

Jul 30, 2009, 9:13:22 PM7/30/09

to

On 31 июл, 05:07, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > I think that Relacy will show the truth.
>
> I bet it will. I know that this is a problem for the lock-free stack. Humm,
> perhaps I can see if Relacy can detect the error on the stack algorithm. Its
> a very subtle race-condition, but it will destroy the stack and loose nodes.
> Relacy should detect this as a memory leak. One problem, I am not sure how
> to model this moment in Relacy. I need to create a special struct for a
> DWCAS anchor in order to be compatible with rl::atomic. However, once I do
> that, rl::atomic will load the entire anchor atomically with no chance of a
> preemption in between the load to the anchors head pointer and the
> subsequent load to the anchors aba counter. Am I making any sense to you?

C++0x atomics do not support non-atomic accesses to atomic variables.
Hmmm.... Well, I think, you may just load anchor twice :)
Load whole anchor, get pointer from it. Load whole anchor one more
time, get counter from it. It must be enough for Relacy to squeeze
into.

--
Dmitriy V'jukov

Dmitriy V'jukov

unread,

Jul 30, 2009, 9:19:31 PM7/30/09

to

On 31 июл, 04:45, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> I don't worry about the (***) part as it's an EXTREMELY small window. Also,
> if a programs threads can just up and die at (***) then, IMHO, the program
> has a lot more "issues" to worry about than this one...

Yes, spuriously dying threads are indeed problematic :)

That was just a comments on *formal* semantics. Theoretically, it's
not a lock-free queue because lock-free algorithm must be safe wrt
thread termination.
But from practical POV, yes, I do not care about (***) part too.

>
> ;^)
>
> So, we now have intrusive/non-intrusive versions of a high-performance MPMC
> queue that beats the shi% out of the Michael and Scott queue:
>
> Very cool!

Ah, I know what Michael and Scott will say, they will say "your queue
is not lock-free". Damn!

--
Dmitriy V'jukov

Chris M. Thomasson

unread,

Jul 30, 2009, 11:07:52 PM7/30/09

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:413440f3-a557-468d...@n11g2000yqb.googlegroups.com...

On 31 июл, 05:07, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > > I think that Relacy will show the truth.

[...]

> > However, once I do
> > that, rl::atomic will load the entire anchor atomically with no chance
> > of a
> > preemption in between the load to the anchors head pointer and the
> > subsequent load to the anchors aba counter. Am I making any sense to
> > you?

> C++0x atomics do not support non-atomic accesses to atomic variables.

This raises another question... Lets say that a particular C++0x impl on a
32-bit system has lock-free access to a 64-bit variable. Will it have to use
64-bit atomic load instruction on the 32-bit system to load 64-bit anchor?
For instance, on IA-32, it would use a MOVQ instruction?

> Hmmm.... Well, I think, you may just load anchor twice :)
> Load whole anchor, get pointer from it. Load whole anchor one more
> time, get counter from it. It must be enough for Relacy to squeeze
> into.

Load anchor twice; something like:

void pop() {
for (;;) {
stack_anchor real;
stack_anchor a = m_anchor($).load(rl::memory_order_relaxed);
real.m_head = a.m_head;
// perhaps add an explicit yield?
stack_anchor b = m_anchor($).load(rl::memory_order_relaxed);
real.m_aba = b.m_aba;

// blah; perform CAS with `real' variable...
}
}

well, that should do it.

:^)

Thanks Dmitriy.

Chris M. Thomasson

unread,

Jul 31, 2009, 2:36:10 AM7/31/09

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:02c80e6f-4478-4e54...@j32g2000yqh.googlegroups.com...

Then we can become smart ass and say:

"This is just a very simple example of how clever lock-based queue can beat
the hell out of a crappy lock-free queue..."

Ouch!

Dmitriy V'jukov

unread,

Jul 31, 2009, 5:35:13 PM7/31/09

to

On 31 июл, 07:07, "Chris M. Thomasson" <n...@spam.invalid> wrote:

> > > > I think that Relacy will show the truth.
> [...]
> > > However, once I do
> > > that, rl::atomic will load the entire anchor atomically with no chance
> > > of a
> > > preemption in between the load to the anchors head pointer and the
> > > subsequent load to the anchors aba counter. Am I making any sense to
> > > you?
> > C++0x atomics do not support non-atomic accesses to atomic variables.
>
> This raises another question... Lets say that a particular C++0x impl on a
> 32-bit system has lock-free access to a 64-bit variable. Will it have to use
> 64-bit atomic load instruction on the 32-bit system to load 64-bit anchor?
> For instance, on IA-32, it would use a MOVQ instruction?

I think that the answer is yes.

--
Dmitriy V'jukov

Chris M. Thomasson

unread,

Aug 12, 2009, 9:21:11 PM8/12/09

to

"Dmitriy V'jukov" <dvy...@gmail.com> wrote in message

news:c872ad32-6a06-4696...@r2g2000yqm.googlegroups.com...

On 31 июл, 04:39, "Chris M. Thomasson" <n...@spam.invalid> wrote:
[...]
> I am not sure about this yet but I think that you might need to load the
> ABA
> counter _before_ the head pointer. I know that this is requirement for
> lock-free stack when one uses normal CAS for producer side of the
> algorithm.
> Here is how a such a lock-free stack can bite the dust if the ABA counter
> is
> loaded _after_ the head pointer:
> ___________________________________________________________________

[...]

>
> Think of current state being top = pointerA and oc = 1 and threadA loads
> pointerA; get premepted by threadB which pops pointerA, thus increment the
> oc from 1 to 2; ThreadA resumes and loads version 2 and gets to line SC5
> thus reading NULL as a next pointer; gets preempted by threadB which
> pushes
> pointerB, and then pointerA onto the list mutating the state to top =
> pointerA and oc = 2 with pointerA having a next node equal to pointerB;
> ThreadA resumes and performs a successful DWCAS operation when it should
> have failed thus swapping the top of the list with NULL and all the nodes
> get lost!
> ___________________________________________________________________
>
> Not sure if this applies to the DWCAS portion of the MPMC queue.

> I don't see now how this may break my algorithm. I think that Relacy
> will show the truth.

It does:

http://relacy.pastebin.com/f4b57bda2

Relacy says that the non-intrusive version of the algorithm I created is
indeed effected by reading the tail pointer before the aba counter. It leaks
memory, just like the stack does. Uncomment `TURN_OFF_RACER' macro to see
fixed version in action. Also, it shows interesting moment wrt memory
barriers. Sequential consistency is apparently needed between the push and
pop wrt setting the previous next pointer in push, and loading the next
pointer in pop.