Dmitriy's MPMC Queue for Linux g++ >= 4.3

558 views
Skip to first unread message

Patrick F

unread,
Dec 29, 2011, 2:30:16 PM12/29/11
to Scalable Synchronization Algorithms
Dmitriy's MPMC Queue looks very promising, unfortunately its written
for Windows/MSVC.

Is there a modified version of Dmitriy MPMC Queue out there, that is
compatible with g++ and Linux?
If not, is there a similar performing MPMC Queue?

I've been experimenting with TBB concurrent_queue, but it does not
perform much better then a simple wrapped std::deque using mutex
locks.

Any hints?

If nobody has done this, i would be willing to try porting Dmitriy's
bounded MPMC queue to g++ compatible code.

Thanks in advance!

-------------
Patrick F.
P.S. Sorry for any spelling or grammar errors - I am not a native
speaker of english.

Dmitriy Vyukov

unread,
Dec 30, 2011, 2:19:26 AM12/30/11
to Scalable Synchronization Algorithms
On Dec 29, 10:30 pm, Patrick F <patrick.fl...@gmail.com> wrote:
> Dmitriy's MPMC Queue looks very promising, unfortunately its written
> for Windows/MSVC.
>
> Is there a modified version of Dmitriy MPMC Queue out there, that is
> compatible with g++ and Linux?
> If not, is there a similar performing MPMC Queue?
>
> I've been experimenting with TBB concurrent_queue, but it does not
> perform much better then a simple wrapped std::deque using mutex
> locks.
>
> Any hints?
>
> If nobody has done this, i would be willing to try porting Dmitriy's
> bounded MPMC queue to g++ compatible code.

Hi Patrick,

What exactly queue do you mean?
In general, all the same primitives are available on Linux/gcc.
gcc builtin atomics:
http://gcc.gnu.org/onlinedocs/gcc-4.1.1/gcc/Atomic-Builtins.html
__asm __volatile("" ::: "memory") for compiler fences.
pthread for threading primitives.
It should be quite trivial to port most sync algos Win<->Linux.

Patrick F

unread,
Dec 30, 2011, 8:15:08 PM12/30/11
to Scalable Synchronization Algorithms, Dmitriy Vyukov
Hi Dmitriy,
> What exactly queue do you mean?at first i took a look at your mpmc block queue i got from 1024cores.net.Im not sure if I understand the code correctly, but it only works for 32 Bits pointer types, correct?
In turn i took your code from the bounded queue also found at
1024cores.net, which was much easier to understand.You were right, the
porting to g++ compatible code was very straight forward.
Unfortunately the performance was not as hoped. The following shows
the results in comparison to the TBB concurrent queueand a simple
wrapped fifo queue using ordinary mutexes. The Queues are previously
filled with 1,000,000 (1 million) elements. The execution time (in
seconds) was measured to completely empty the queue via the dequeue
operation. So only dequeue operations are measured.
Threads  Dmitriy MPMC   Wrapped FIFO   TBB queue1        0.055961    
  0.051541       0.0572042        0.294067       0.635973      
0.1890954        0.313675       0.632279       0.2216298      
 0.410893       0.36722        0.323356
Are these results plausible? I hoped the bounded queue would be alot
faster ...
Thanks in advance! ------------- Patrick F. P.S. Sorry for any
spelling or grammar errors - I am not a native speaker of english.

Patrick F

unread,
Dec 30, 2011, 8:21:07 PM12/30/11
to Scalable Synchronization Algorithms
Sorry the formatting got all messed up, here is the table again:


Threads  Dmitriy MPMC   Wrapped FIFO   TBB queue

1        0.055961       0.051541       0.057204

2        0.294067       0.635973       0.189095

4        0.313675       0.632279       0.221629

8        0.410893       0.36722        0.323356


On Dec 31, 2:15 am, Patrick F <patrick.fl...@gmail.com> wrote:
> Hi Dmitriy,> What exactly queue do you mean?at first i took a look at your mpmc block queue i got from 1024cores.net.Im not sure if I understand the code correctly, but it only works for 32 Bits pointer types, correct?
>
> In turn i took your code from the bounded queue also found at
> 1024cores.net, which was much easier to understand.You were right, the
> porting to g++ compatible code was very straight forward.
> Unfortunately the performance was not as hoped. The following shows
> the results in comparison to the TBB concurrent queueand a simple
> wrapped fifo queue using ordinary mutexes. The Queues are previously
> filled with 1,000,000 (1 million) elements. The execution time (in
> seconds) was measured to completely empty the queue via the dequeue
> operation. So only dequeue operations are measured.
> Threads  Dmitriy MPMC   Wrapped FIFO   TBB queue1        0.055961
>   0.051541       0.0572042        0.294067       0.635973
> 0.1890954        0.313675       0.632279       0.2216298
>  0.410893       0.36722        0.323356
> Are these results plausible? I hoped the bounded queue would be alot
> faster ...
> Thanks in advance! ------------- Patrick F. P.S. Sorry for any
> spelling or grammar errors - I am not a native speaker of english.

Samy Al Bahra

unread,
Dec 30, 2011, 11:00:05 PM12/30/11
to lock...@googlegroups.com
Could you please share the source-code to your port and your benchmark?
Also, what mutex implementation are you using for the wrapped FIFO?


--
Samy Al Bahra
http://repnop.org/

Patrick F

unread,
Dec 31, 2011, 7:07:14 AM12/31/11
to Scalable Synchronization Algorithms
Hi Samy,

i'm using boost::lock for the wrapped FIFO and std::deque as the FIFO
implementation.

Let me strip down and clean up my code a bit before sharing, i can't
post it like this. Hope to have it by tomorrow.

Btw thanks for your fast reaction. :)

Edward Lam

unread,
Dec 31, 2011, 8:28:17 AM12/31/11
to lock...@googlegroups.com, Scalable Synchronization Algorithms
These results seem to show that TBB's concurrent_queue performs quite well compared to the others? The 8 thread case might be explained by hyperthreading? What hardware are you running on?

- Edward (iPhone)

Patrick F

unread,
Dec 31, 2011, 1:16:28 PM12/31/11
to Scalable Synchronization Algorithms, Edward Lam, sba...@repnop.org
Hi all,

the code is now available at http://dl.dropbox.com/u/7767139/queuetests.tar.gz
It requires that tbb and boost thread libraries are installed.
If that requirement is met, it should easily build for gcc version >=
4.4 with "make timequeues"

The seamed to be a small error in the last data for the FIFO having
the same results for both 2 and 4 threads.
Here is the data i collected with the current code (note: the ordering
of columns is now different)

#Threads Wrapped FIFO TBB Dmitriy MPMC

1 0.05082 0.04872 0.05588

2 0.21983 0.18456 0.29447

4 0.61581 0.21828 0.30902

8 0.36285 0.32872 0.40559



The machine this is running on is a 8 core Intel system with two Intel
Xeon 5345 2.33 GHz (Quad-Core) CPUs and a shared system memory of 16
GB.

hwloc-info shows that it's still a UMA system:

System(15GB)
Socket#0
L2(4096KB)
L1(32KB) + Core#0 + P#0
L1(32KB) + Core#1 + P#2
L2(4096KB)
L1(32KB) + Core#2 + P#4
L1(32KB) + Core#3 + P#6
Socket#1
L2(4096KB)
L1(32KB) + Core#0 + P#1
L1(32KB) + Core#1 + P#3
L2(4096KB)
L1(32KB) + Core#2 + P#5
L1(32KB) + Core#3 + P#7



Best Regards and a happy new year!

Patrick

Dmitriy V'jukov

unread,
Feb 19, 2012, 8:19:31 AM2/19/12
to lock...@googlegroups.com, Edward Lam, sba...@repnop.org
On Sat, Dec 31, 2011 at 9:16 PM, Patrick F <patric...@gmail.com> wrote:
> Hi all,
>
> the code is now available at http://dl.dropbox.com/u/7767139/queuetests.tar.gz
> It requires that tbb and boost thread libraries are installed.
> If that requirement is met, it should easily build for gcc version >=
> 4.4 with "make timequeues"
>
> The seamed to be a small error in the last data for the FIFO having
> the same results for both 2 and 4 threads.
> Here is the data i collected with the current code (note: the ordering
> of columns is now different)
>
> #Threads   Wrapped FIFO   TBB        Dmitriy MPMC
>
> 1          0.05082        0.04872    0.05588
>
> 2          0.21983        0.18456    0.29447
>
> 4          0.61581        0.21828    0.30902
>
> 8          0.36285        0.32872    0.40559


What are results with -DDO_WORK?
The less concurrency a thing allows the better it behaves in synthetic
benchmarks: more blocked threads -> less contention -> faster. The
absolute winner in the benchmark will be a single-threaded queue and a
single producer. But it's not an option for a real application, right?
We need real concurrency.

--
Dmitry Vyukov

All about lockfree/waitfree algorithms, multicore, scalability,
parallel computing and related topics:
http://www.1024cores.net

Reply all
Reply to author
Forward
0 new messages