How is it that he is calling his queue the most efficient and fastest mpmc queue ?

aminer

unread,

Dec 15, 2011, 3:56:16 PM12/15/11

to

I have looked at the mpmc queue from Chris M. Thomasson ,
and i think that his is using the Bakery algorithm in his mpmc queue,
but since he is using the Bakery algorithm it's the same as using
critical sections ,
so why Chris M. Thomasson is saying that his algorithm is the most
efficient
algorithm and the fastest one ? perhaps i don't understand what Chris
is saying ?

can you explain please ?

thank you.

Amine Moulay Ramdane.

aminer

unread,

Dec 15, 2011, 4:18:42 PM12/15/11

to

But perhaps i am missing something, and i don't understand well his
algorithm...

So, can you explain correctly why it's the most efficient mpmc queue
and the fastest one...

Amine Moulay Ramdane.

aminer

unread,

Dec 15, 2011, 4:30:35 PM12/15/11

to

To be honest with you , i can not use his mpmc queue cause i don't
understand it
correctly, and also it must be well tested and he must prove it to us
formally that it's
working correctly, and also he said that his mpmc queue is the most
efficient and one of
the fastest one , so he must also prove it to us. But until then i can
not use his mpmc queue.

> > Amine Moulay Ramdane.- Hide quoted text -
>
> - Show quoted text -

aminer

unread,

Dec 15, 2011, 5:45:08 PM12/15/11

to

I have asked someone called Darek to translate the Chris M.
Thomasson's
mpmc queue to Delphi and Freepascal, but the answer was :

"But there is used other primitives: LOAD and STORE
http://en.wikipedia.org/wiki/Load-link/store-conditional
Not all architecture have it:" Darek

So, with wich primitives i have to replace Store() and Load() in the
x86 architecture ?

> > - Show quoted text -- Hide quoted text -

Chris M. Thomasson

unread,

Dec 16, 2011, 4:17:52 AM12/16/11

to

"aminer" <amin...@gmail.com> wrote in message
news:0dac4225-c70d-4d60...@da3g2000vbb.googlegroups.com...

>I have asked someone called Darek to translate the Chris M.
>Thomasson's
> mpmc queue to Delphi and Freepascal, but the answer was :

> "But there is used other primitives: LOAD and STORE
> http://en.wikipedia.org/wiki/Load-link/store-conditional
> Not all architecture have it:" Darek

> So, with wich primitives i have to replace Store() and Load() in the
> x86 architecture ?

I don't know Pascal very well; sorry about that. I am typing this in
directly in the newsreader, I believe it's some form of "Pascal"... I think
I totally forgot how pointers work:
_________________________________________________
{ Data Types }
type cell = record
ver:integer;
state:real;
end;

{ Global Variables/Prodecures }

{ N must be power of 2 }
var cells : array[0..(N - 1)] of cell;
var head:integer;
var tail:integer;

procedure init;
var i:integer;
begin
head := 0;
tail := 0;
for i := 0 to N do cells[i].ver := i;
end;

{ Atomic Ops / MASM impl }
ATOMIC_STORE PROC
mov ecx, [esp + 4]
mov eax, [esp + 8]
mov [ecx], eax
ret
ATOMIC_STORE ENDP

ATOMIC_LOAD PROC
mov ecx, [esp + 4]
mov eax, [ecx]
ret
ATOMIC_LOAD ENDP

ATOMIC_XADD PROC
mov ecx, [esp + 4]
mov eax, [esp + 8]
lock xadd [ecx], eax
ret
ATOMIC_XADD ENDP

{ Backoff/Blocking Code }
procedure backoff;
begin
{
use whatever for now (e.g., Sleep(1)).
The algorithm is in need of a distributed
conditional waiting algorihtm! Distributed
Eventcounts sound nice... ;^)
}
end;

{ Queue Ops }
procedure producer(state:real);
var ver:integer;
var c:^cell;
begin
ver := ATOMIC_XADD(@head, 1);
c := @cells[ver and (N - 1)];
while ATOMIC_LOAD(@c^.ver) != ver do backoff;
c^.state := state;
ATOMIC_STORE(@c^.ver, ver + 1);
end;

function consumer:real;
var ver:integer;
var c:^cell;
begin
ver := ATOMIC_XADD(@tail, 1);
c := @cells[ver and (N - 1)];
while ATOMIC_LOAD(@c^.ver) != ver + 1 do backoff;
consumer := c^.state;
ATOMIC_STORE(@c^.ver, ver + N);
end;
_________________________________________________

I hope that helps for now...

Chris M. Thomasson

unread,

Dec 16, 2011, 4:27:03 AM12/16/11

to

"Chris M. Thomasson" <cri...@charter.net> wrote in message
news:%0EGq.1020$Mw4...@newsfe06.iad...

OOPS! I think that should be:

for i := 0 to N - 1 do cells[i].ver := i;

;^/

> end;

Chris M. Thomasson

unread,

Dec 16, 2011, 5:02:42 AM12/16/11

to

"aminer" <amin...@gmail.com> wrote in message

news:8269e127-41ff-4089...@n6g2000vbg.googlegroups.com...

>
> I have looked at the mpmc queue from Chris M. Thomasson ,
> and i think that his is using the Bakery algorithm in his mpmc queue,
> but since he is using the Bakery algorithm it's the same as using
> critical sections ,

I am using a _distributed_ bakery algorithm... Please, examine the algorithm
very closely...

XADD = atomic fetch and add ; XADD on x86
LOAD = atomic load ; MOV on x86
STORE = atomic store ; MOV on x86

> so why Chris M. Thomasson is saying that his algorithm is the most
> efficient
> algorithm and the fastest one ? perhaps i don't understand what Chris
> is saying ?
>
> can you explain please ?

AFAICT, it's pretty good algorithm wrt existing MPMC bounded queues. One
single XADD on push/pop. This is the only way I can think of getting around
CAS! :^)

aminer

unread,

Dec 16, 2011, 10:26:08 AM12/16/11

to

Here is the code that i have changed , and when i run it gives a
runtime bug in the program,
can you please take a look at it and tell me what's the problem, is
the assembler code
correct ? i have translate the assembler code to the delphi basm ...

here is the code:

----

program test;

uses windows;

{ Data Types }
type cell = record
ver:integer;
state:real;
end;
{ Global Variables/Prodecures }
{ N must be power of 2 }

const N=16;

var cells : array[0..(N - 1)] of cell;
var head:integer;
var tail:integer;
procedure init;
var i:integer;
begin
head := 0;
tail := 0;

for i := 0 to N-1 do cells[i].ver := i;
end;

function ATOMIC_STORE(T: POINTER; C:integer ): integer;
assembler;stdcall;
asm
mov ecx, T
mov eax, C
mov [ecx], eax
end;
function ATOMIC_LOAD(T: POINTER): integer; assembler;stdcall;
asm
mov ecx, T
mov eax, [ecx]
end;
function ATOMIC_XADD(T: POINTER; C:integer ): integer;
assembler;stdcall;
asm
mov ecx, T
mov eax, C

lock xadd [ecx], eax

end;

{ Backoff/Blocking Code }
procedure backoff;
begin

sleep(1);

end;
{ Queue Ops }

procedure producer(state:real);
var ver:integer;
var c:^cell;
begin
ver := ATOMIC_XADD(@head, 1);
c := @cells[ver and (N - 1)];

while ATOMIC_LOAD(@c^.ver) <> ver do backoff;

c^.state := state;
ATOMIC_STORE(@c^.ver, ver + 1);
end;

function consumer:real;
var ver:integer;
var c:^cell;
begin
ver := ATOMIC_XADD(@tail, 1);
c := @cells[ver and (N - 1)];

while ATOMIC_LOAD(@c^.ver) <> ver + 1 do backoff;

consumer := c^.state;
ATOMIC_STORE(@c^.ver, ver + N);
end;

begin
producer(24.0);
producer(23.0);

writeln(consumer);

end.

aminer

unread,

Dec 16, 2011, 11:54:47 AM12/16/11

to

Hello,

Another question:

You are doing this on the consumer side:

ver := ATOMIC_XADD(@tail, 1);

How are you avoiding that tail is not going beyong head in your
algorithm ?

and you are doing on the producer side:

ver := ATOMIC_XADD(@head, 1);

How are you avoiding that head is not going beyong tail in your
algorithm ?

Regards,
Amine Moulay Ramdane.

Chris M. Thomasson

unread,

Dec 16, 2011, 2:51:37 PM12/16/11

to

"aminer" <amin...@gmail.com> wrote in message

news:b04b1d21-a39c-4394...@z25g2000vbs.googlegroups.com...

>
> Here is the code that i have changed , and when i run it gives a
> runtime bug in the program,
> can you please take a look at it and tell me what's the problem, is
> the assembler code
> correct ? i have translate the assembler code to the delphi basm ...

[...]

The following compiles and runs for me on freepascal:

http://pastebin.com/2yTjUJBH

I basically had to change all integer types to longword and use AT&T syntax
for the inline assembler. Anyway, I think I will work on a distributed
conditional waiting mechanism for this. That way, we can get rid of the
backoff and use "real" waiting primitives. ;^)

Chris M. Thomasson

unread,

Dec 16, 2011, 2:56:55 PM12/16/11

to

"aminer" <amin...@gmail.com> wrote in message

news:9c882317-7733-45b7...@i6g2000vbe.googlegroups.com...

> Another question:

> You are doing this on the consumer side:
> ver := ATOMIC_XADD(@tail, 1);

> How are you avoiding that tail is not going beyong head in your
> algorithm ?

The tail version is allowed to be greater than the head version.

> and you are doing on the producer side:
> ver := ATOMIC_XADD(@head, 1);

> How are you avoiding that head is not going beyong tail in your
> algorithm ?

The head version is allowed to be greater than the tail version.

It's the distributed bakery algorithms that makes everything actually work.

Now that I have a working version in FreePascal, I think you will quickly
catch on to exactly what I am doing here with the version numbers...

aminer

unread,

Dec 16, 2011, 3:50:38 PM12/16/11

to

On Dec 16, 2:51 pm, "Chris M. Thomasson" <cris...@charter.net> wrote:
> "aminer" <amine...@gmail.com> wrote in message

Can you remember me what is the FPC switch to compile the AT&T
assembler syntax ?

also is ther any problem to change the double type with and integer
type in
your algorithm so that i can test it against my lockfree_mpmc fifo
queue ?

Amine Moulay Ramdane.

aminer

unread,

Dec 16, 2011, 3:56:48 PM12/16/11

to

On Dec 16, 3:50 pm, aminer <amine...@gmail.com> wrote:
> On Dec 16, 2:51 pm, "Chris M. Thomasson" <cris...@charter.net> wrote:
>
>
>
>
>
> > "aminer" <amine...@gmail.com> wrote in message
>
> >news:b04b1d21-a39c-4394...@z25g2000vbs.googlegroups.com...
>
> > > Here is the code that i have changed , and when i run it gives a
> > > runtime bug in the program,
> > > can you please take a look at it and tell me what's the problem, is
> > > the assembler code
> > > correct ? i have translate the assembler code to the delphi basm ...
>
> > [...]
>
> > The following compiles and runs for me on freepascal:
>
> >http://pastebin.com/2yTjUJBH
>
> > I basically had to change all integer types to longword and use AT&T syntax
> > for the inline assembler. Anyway, I think I will work on a distributed
> > conditional waiting mechanism for this. That way, we can get rid of the
> > backoff and use "real" waiting primitives. ;^)
>
> Can you remember me what is the FPC switch to compile the AT&T
> assembler syntax ?

Just forget about the FPC switch for AT&T assembler.

>
> also is ther any problem to change the double type with and integer
> type in
> your algorithm so that i can test it against my lockfree_mpmc fifo
> queue ?
>

aminer

unread,

Dec 16, 2011, 4:42:12 PM12/16/11

to

Hello,

I have tested you queue with 2 threads, and it gives very bad
performance.

Can you please do your benchmarks with multiple threads and give us
the throuput ?

example my lockfree_mpmc on an Intel Core 2 Quad Q6600:with 4 cores
and 4 threads
under contention does give a throuput of:

38 millions of pop transactions/s

and

9 millions of push transactions/s

thank you.

Amine Moulay Ramdane.

> > - Show quoted text -- Hide quoted text -
>
> - Show quoted text -- Hide quoted text -

Chris M. Thomasson

unread,

Dec 17, 2011, 12:58:53 AM12/17/11

to

"aminer" <amin...@gmail.com> wrote in message

news:483dd461-9327-4622...@i8g2000vbh.googlegroups.com...

> I have tested you queue with 2 threads, and it gives very bad
> performance.

What was the size of N? I only have it set to 4 in my crude example. Also, I
did nothing to prevent false sharing simply because I don't know how to
explicitly pad to cache line and align on cache line boundary in Pascal.
Perhaps you can help me out with that... Something like this:

http://groups.google.com/group/comp.programming.threads/msg/f2b1cac59c8e98a3

how can I do that in Pascal?

Also, it really needs distributed conditional blocking mechanism.

> Can you please do your benchmarks with multiple threads and give us
> the throuput ?

I am going to implement it in C very soon and eliminate false sharing
wherever I can. I will post the code here. The only other algorithm I think
I can benchmark against is this beautiful algorithm invented by Dmitriy
Vyukov:

http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpmc-queue

IMVHO, it's the only one I can find that is comparable.

> example my lockfree_mpmc on an Intel Core 2 Quad Q6600:with 4 cores
> and 4 threads
> under contention does give a throuput of:

> 38 millions of pop transactions/s

> and

> 9 millions of push transactions/s

Can you show me the code?

aminer

unread,

Dec 17, 2011, 1:14:11 AM12/17/11

to

On Dec 17, 12:58 am, "Chris M. Thomasson" <cris...@charter.net> wrote:
> "aminer" <amine...@gmail.com> wrote in message

>
> news:483dd461-9327-4622...@i8g2000vbh.googlegroups.com...
>
> > I have tested you queue with 2 threads, and it gives very bad
> > performance.
>
> What was the size of N? I only have it set to 4 in my crude example. Also, I
> did nothing to prevent false sharing simply because I don't know how to
> explicitly pad to cache line and align on cache line boundary in Pascal.
> Perhaps you can help me out with that... Something like this:
>

> http://groups.google.com/group/comp.programming.threads/msg/f2b1cac59...

>
> how can I do that in Pascal?

You can do this for example:

type
typecache1 = array[0..15] of longword;

and do something like this:

var
tail:longword;
tmp1:typecache1;
head: longword;

>
> Also, it really needs distributed conditional blocking mechanism.
>
> > Can you please do your benchmarks with multiple threads and give us
> > the throuput ?
>
> I am going to implement it in C very soon and eliminate false sharing
> wherever I can. I will post the code here. The only other algorithm I think
> I can benchmark against is this beautiful algorithm invented by Dmitriy
> Vyukov:
>

> http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpm...

>
> IMVHO, it's the only one I can find that is comparable.
>
> > example my lockfree_mpmc on an Intel Core 2 Quad Q6600:with 4 cores
> > and 4 threads
> > under contention does give a throuput of:
> > 38 millions of pop transactions/s
> > and
> > 9 millions of push transactions/s
>
> Can you show me the code?

Here is the code:

http://pages.videotron.com/aminer/zip/lockfree_queue.zip

Just compile the pop.pas and push.pas tests examples that you
will find inside the zip file....

Amine Moulay Ramdane.

Chris M. Thomasson

unread,

Dec 17, 2011, 1:07:27 PM12/17/11

to

"aminer" <amin...@gmail.com> wrote in message

news:27e050a5-efd9-49c2...@p9g2000vbb.googlegroups.com...

> On Dec 17, 12:58 am, "Chris M. Thomasson" <cris...@charter.net> wrote:
> > "aminer" <amine...@gmail.com> wrote in message
> >
> > news:483dd461-9327-4622...@i8g2000vbh.googlegroups.com...
> >
> > > I have tested you queue with 2 threads, and it gives very bad
> > > performance.
> >
> > What was the size of N? I only have it set to 4 in my crude example.
> > Also, I
> > did nothing to prevent false sharing simply because I don't know how to
> > explicitly pad to cache line and align on cache line boundary in Pascal.
> > Perhaps you can help me out with that... Something like this:
> >
> > http://groups.google.com/group/comp.programming.threads/msg/f2b1cac59...
> >
> > how can I do that in Pascal?

> You can do this for example:

> type
> typecache1 = array[0..15] of longword;

> and do something like this:

> var
> tail:longword;
> tmp1:typecache1;
> head: longword;

Okay. I can do something with that; thank you. However, I still am wondering
how to actually align data-structures on cache line boundaries. This is very
important because if the data-structure is padded but _not_ properly
aligned, you can still get false sharing. Basically, I want to know how to
perform pointer arithmetic such that I can get the queue aligned properly?

> Also, it really needs distributed conditional blocking mechanism.
>
> > Can you please do your benchmarks with multiple threads and give us
> > the throuput ?

I should have a crude C test program up by the end of today or the middle of
tomorrow. I will test against Dmitriy's awesome queue algorithm, and the
dual spinlock queue (e.g., the one that you are using).

> I am going to implement it in C very soon and eliminate false sharing
> wherever I can. I will post the code here. The only other algorithm I
> think
> I can benchmark against is this beautiful algorithm invented by Dmitriy
> Vyukov:
>
> http://www.1024cores.net/home/lock-free-algorithms/queues/bounded-mpm...
>
> IMVHO, it's the only one I can find that is comparable.
>
> > example my lockfree_mpmc on an Intel Core 2 Quad Q6600:with 4 cores
> > and 4 threads
> > under contention does give a throuput of:
> > 38 millions of pop transactions/s
> > and
> > 9 millions of push transactions/s
>
> Can you show me the code?

> Here is the code:

> http://pages.videotron.com/aminer/zip/lockfree_queue.zip

> Just compile the pop.pas and push.pas tests examples that you
> will find inside the zip file....

Sorry, but I cannot find my queue code in there. Can you please provide me
with the test you used for my algorithm specifically. I say "specifically"
simply because the pop and push tests you are using are _not_ compatible
with my mpmc queue interface. You seem to be expecting a boolean return
value indicating full/empty border conditions. My queue implicitly blocks on
such conditions. AFAICT, Pascal won't even compile my queue interface WRT
your test harness. BTW, I still need to know what the size of N was. I
cannot see my algorithm in the zip file, therefore, I cannot test my own
algorithm without altering your existing test code.

aminer

unread,

Dec 17, 2011, 3:01:30 PM12/17/11

to

I have used an object for that cause all the members are 32 bit
alligned inside an object.

> algorithm without altering your existing test code.- Hide quoted text -

I have noticed that when i push 100 integers it's working, but when i
push 5000 integers
it's not working.

Would you want me to show you the test? but the test is not working
correclty here.

Amine Moulay Ramdane

Chris M. Thomasson

unread,

Dec 17, 2011, 5:34:54 PM12/17/11

to

"aminer" <amin...@gmail.com> wrote in message

news:07cf9ccc-19b7-4708...@m7g2000vbc.googlegroups.com...

On Dec 17, 1:07 pm, "Chris M. Thomasson" <cris...@charter.net> wrote:
> "aminer" <amine...@gmail.com> wrote in message
>

No. I want the queue to be aligned on a cache line boundary. 32-bit is not
going to cut it. I can do this in C, not sure how to do it in Pascal.

Heck yeah I want to see the test. I know that the algorithm itself works
fine, I cannot see how pushing 5000 integers is not working unless you are
totally misusing my algorithm. Keep in mind that it implicitly provides
blocking for full/empty cases. I already know that the test code that I
downloaded is simply not compatible with the interface into my queue. So,
yes I need to see the code.

aminer

unread,

Dec 17, 2011, 5:56:46 PM12/17/11

to

> yes I need to see the code.- Hide quoted text -

Here is the code, just compiler mpmc.pas with the -Sd option under
freepascal,
and notice that it is not working correctly.

http://pages.videotron.com/aminer/zip/test1.zip

aminer

unread,

Dec 17, 2011, 5:59:58 PM12/17/11

to

I mean you have to compile it with the -Sd option , -Sd means the
delphi mode.

Chris M. Thomasson

unread,

Dec 17, 2011, 7:16:22 PM12/17/11

to

Ummm... you have N set at 10000000, which is _not_ a power of 2!

Also, try converting all of the integer types wrt the algorithm itself to
longwords.
Try 16777216 or something...

;^/

I will fix your Pascal code and post it here in a little while.

Chris M. Thomasson

unread,

Dec 17, 2011, 7:23:34 PM12/17/11

to

"Chris M. Thomasson" <cri...@charter.net> wrote in message

news:ghaHq.1569$Gb....@newsfe03.iad...

> Ummm... you have N set at 10000000, which is _not_ a power of 2!
>
>
> Also, try converting all of the integer types wrt the algorithm itself to
> longwords.
> Try 16777216 or something...
>
>
> ;^/
>

Also, wrt the bounded buffer! You have 4 threads pushing 100000 items
each... So the buffer has to be bigger than 400000 items or you will
deadlock my algorithm because it implicitly blocks on full/empty!

Chris M. Thomasson

unread,

Dec 17, 2011, 7:45:07 PM12/17/11

to

"Chris M. Thomasson" <cri...@charter.net> wrote in message
news:ghaHq.1569$Gb....@newsfe03.iad...

Also, please consider padding the cell records to a l2 cache line as well.
heck, pad the version counter along with the user state!

;^)

aminer

unread,

Dec 17, 2011, 7:50:57 PM12/17/11

to

On Dec 17, 7:23 pm, "Chris M. Thomasson" <cris...@charter.net> wrote:
> "Chris M. Thomasson" <cris...@charter.net> wrote in messagenews:ghaHq.1569$Gb....@newsfe03.iad...

>
> > Ummm... you have N set at 10000000, which is _not_ a power of 2!
>
> > Also, try converting all of the integer types wrt the algorithm itself to
> > longwords.
> > Try 16777216 or something...
>
> > ;^/
>
> Also, wrt the bounded buffer! You have 4 threads pushing 100000 items
> each... So the buffer has to be bigger than 400000 items or you will
> deadlock my algorithm because it implicitly blocks on full/empty!

Now it's working, and i have get the benchmarks on the push side, it
gives

7.4 millions push transactions/s and it does not scale with 4 threads
on four cores...

my other algorithm with two spinlocks is better , it has given me:

9,8 millions of push transactions/s

i will try to test the pop side, but i think it's not better than my
other two spinlocks algorithm.

Amine Moulay Ramdane.