Implementation of unbounded lock-free queue in C++

Johny

unread,

Sep 1, 2016, 6:36:56 AM9/1/16

to mechanical-sympathy

Hi,

I've implemented an unbounded lock-free queue MPMC in C++. It is placed here: https://github.com/anonim199935/queue/blob/master/LockFreeQueue.hpp

But, the problem is that it works slowly, even slower than naive lock implementation. I am searching a way to make it faster but I consider if is it possible to make it faster in significally way at whole?

As usually, the problematic is CAS instruction, especially high contetnion is a painful issue here. I manage a memory in manual way so every thread "taking" head or tail must increment it. Bacause of the high contention threads compete with, in CAS-loop there is a heave traffic on a bus ( I mean connection between cores/memory controller). what can lead to cache-misses.

I've tried use of _mm_pause but it didn't give a performance improvement. ( Perhaps, I used it wrongly). I've seen that boost implementation avoided incrementation but it doesn't manage memory manually- it uses something like pool. I tried to lookup a similar implementation but I found only a bounded implementation ( what can be simpler ( and faster) ).

Please suggest me something.

Thanks in advance.

Johny

Martin Thompson

unread,

Sep 1, 2016, 8:59:53 AM9/1/16

to mechanical-sympathy

This is one of the best bounded MPMC queue algorithms I've come across.

http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue

Johny

unread,

Sep 1, 2016, 12:25:27 PM9/1/16

to mechanical-sympathy

Thanks for the link.

I am aware that my implementation is poor. It was written for learnig purpose and just for fun. The given link is about bounded queue. Especially, I see that bounded ( array-based) implementation is just better ( cache-friendly, lack of memory managment and so on) but I've written unbounded one and I consider, is it possible to optimize it significally?

Avi Kivity

unread,

Sep 1, 2016, 3:07:22 PM9/1/16

to mechanica...@googlegroups.com

Consider having a thread that failed its cmpxchg spin on private memory, to avoid bouncing cachelines around. See MCS lock for inspiration.

On 09/01/2016 07:25 PM, Johny wrote:

Thanks for the link.

I am aware that my implementation is poor. It was written for learnig purpose and just for fun. The given link is about bounded queue. Especially, I see that bounded ( array-based) implementation is just better ( cache-friendly, lack of memory managment and so on) but I've written unbounded one and I consider, is it possible to optimize it significally?

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Johny

unread,

Sep 1, 2016, 3:26:37 PM9/1/16

to mechanical-sympathy

Thanks. I will consider it. However, the post is opened still.

Jean-Philippe BEMPEL

unread,

Sep 2, 2016, 9:47:35 AM9/2/16

to mechanical-sympathy

Hi Johny,

If it is for educational purpose it may be interesting to profile your benchmark with perf and with hardware counters to understand where the bottleneck is.

Cheers

Francesco Nigro

unread,

Sep 4, 2016, 12:12:50 PM9/4/16

to mechanical-sympathy

FastFlow has a very good implementation of unbounded queue and doing perf investigations is a very good idea for educational purposes too.
Anyway seems that overriding allocators (to regain the lost cache locality) and node caching are good means to improve the performances,but measuring is better than guessing, always...

Reply all

Reply to author

Forward