Performance of the .NET version

Olivier Deheurles

unread,

Jul 6, 2011, 3:45:04 PM7/6/11

to Disruptor-net

Matt and I spend some time today running performance tests with a
profiler.

From what I have seen the bottlenecks are located in
CacheLineStorageLong implementations.

Current implementation uses Thread.VolatileWrite and VolatileRead
since volatile long is not supported in .NET.

The advantage of using this method is that it garantees atomicity on a
32bits machine and prevent reordering (ie. apply fences)
But it's slow...

In previous implementation I used to apply the fence manually:

Getter:
var data = _data;
Thread.MemoryBarrier();
return data;

Setter:
Thread.MemoryBarrier();
_data = value;

This implementation seems to be faster than VolatileRead/Write but is
safe only on a 64bits machine (x64 seems to be safe, i64 I have no
idea to be honest).

Now regarding false sharing I have not seen any proof yet, and I'm
still trying to find the best way to detect them.

I you have any experience trouble-shooting this kind of problem your
help is very welcome ;)

Olivier

Tim Gebhardt

unread,

Jul 6, 2011, 6:13:01 PM7/6/11

to disrup...@googlegroups.com

Silly question: Should there be an effort to support 32-bit machines?

If not, we set the project to only build on x86-64 and support only that. And then we can go back to using the MemoryBarrier. I know personally that the domain that I'll be using Disruptor-net I need 64-bits of addressing, so I could care less about 32-bit support.

The one difference I can see between the old MemoryBarrier code and how VolatileRead and VolatileWrite are implemented is there is a MethodImpl(NoInlining) attribute attached to the Thread methods. Would the MemoryBarrier method require this as well? Could this be the source of the slowdown in that the old code was optimizing something away it shouldn't have?

Tim

When you look at the original InfoQ presentation, the LMAX guys said they create a ring buffer of size 22 million.

Olivier

unread,

Jul 6, 2011, 9:24:28 PM7/6/11

to Disruptor-net

Process running in 64 bits mode.
UniCast1P1CPerfTest OpsPerSecond run 0: BlockingQueues=2 483 854,00,
Disruptor=25 000 000,00, TPLDataFlow=4 551 661,00,
QueueDisruptorRatio=x10,1
UniCast1P1CPerfTest OpsPerSecond run 1: BlockingQueues=2 487 562,00,
Disruptor=26 109 660,00, TPLDataFlow=4 442 470,00,
QueueDisruptorRatio=x10,5
UniCast1P1CPerfTest OpsPerSecond run 2: BlockingQueues=2 509 410,00,
Disruptor=17 543 859,00, TPLDataFlow=4 480 286,00,
QueueDisruptorRatio=x7,0

Perf is not too bad on 64 bits but it's slower on my PC when running
in 32bits mode.

Process running in 32 bits mode.
UniCast1P1CPerfTest OpsPerSecond run 0: BlockingQueues=2 139 037,00,
Disruptor=14 471 780,00, TPLDataFlow=3 608 805,00,
QueueDisruptorRatio=x6,8
UniCast1P1CPerfTest OpsPerSecond run 1: BlockingQueues=2 340 276,00,
Disruptor=19 841 269,00, TPLDataFlow=3 889 537,00,
QueueDisruptorRatio=x8,5
UniCast1P1CPerfTest OpsPerSecond run 2: BlockingQueues=2 427 184,00,
Disruptor=11 534 025,00, TPLDataFlow=3 660 322,00,
QueueDisruptorRatio=x4,8

Do you see the same results?

On 6 juil, 23:13, Tim Gebhardt <t...@gebhardtcomputing.com> wrote:
> Silly question: Should there be an effort to support 32-bit machines?
>
> If not, we set the project to only build on x86-64 and support only that.
> And then we can go back to using the MemoryBarrier. I know personally that
> the domain that I'll be using Disruptor-net I need 64-bits of addressing, so
> I could care less about 32-bit support.
>
> The one difference I can see between the old MemoryBarrier code and how
> VolatileRead and VolatileWrite are implemented is there is a
> MethodImpl(NoInlining) attribute attached to the Thread methods. Would the
> MemoryBarrier method require this as well? Could this be the source of the
> slowdown in that the old code was optimizing something away it shouldn't
> have?
>
> Tim
>
> When you look at the original InfoQ presentation, the LMAX guys said they
> create a ring buffer of size 22 million.
>

James Miles

unread,

Jul 6, 2011, 11:40:56 PM7/6/11

to Disruptor-net

Olivier. This is the implementation for VolatileRead & VolatileWrite;

[MethodImpl(MethodImplOptions.NoInlining)]
public static long VolatileRead(ref long address)
{
long num = address;
MemoryBarrier();
return num;
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static void VolatileWrite(ref long address, long value)
{
MemoryBarrier();
address = value;

Tim Gebhardt

unread,

Jul 6, 2011, 11:41:34 PM7/6/11

to disrup...@googlegroups.com

Maybe I'm just reading this wrong or you switched your copy/paste, but I see you're getting better throughput with the 64-bit version.

On my own laptop against /trunk SVNr 63:

64-bit
UniCast1P1CPerfTest OpsPerSecond run 0: BlockingQueues=1,012,760.00, Disruptor=14,084,507.00, TPLDataFlow=1,762,735.00, QueueDisruptorRatio=x13.9
UniCast1P1CPerfTest OpsPerSecond run 1: BlockingQueues=1,267,266.00, Disruptor=10,917,030.00, TPLDataFlow=1,783,166.00, QueueDisruptorRatio=x8.6
UniCast1P1CPerfTest OpsPerSecond run 2: BlockingQueues=1,260,398.00, Disruptor=14,144,271.00, TPLDataFlow=1,784,758.00, QueueDisruptorRatio=x11.2

32-bit
UniCast1P1CPerfTest OpsPerSecond run 0: BlockingQueues=1,274,209.00, Disruptor=13,869,625.00, TPLDataFlow=1,611,603.00, QueueDisruptorRatio=x10.9
UniCast1P1CPerfTest OpsPerSecond run 1: BlockingQueues=1,262,148.00, Disruptor=14,184,397.00, TPLDataFlow=1,616,292.00, QueueDisruptorRatio=x11.2
UniCast1P1CPerfTest OpsPerSecond run 2: BlockingQueues=1,262,148.00, Disruptor=14,245,014.00, TPLDataFlow=1,787,629.00, QueueDisruptorRatio=x11.3

Not really that much of a difference, but I'm using a laptop that's a little less-than stellar.

Tim

Olivier Deheurles

unread,

Jul 7, 2011, 3:26:55 AM7/7/11

to disrup...@googlegroups.com, Disruptor-net

James,

Ok so there is no point trying to apply the barriers manually in that case

Olivier

James Miles

unread,

Jul 7, 2011, 3:29:11 AM7/7/11

to Disruptor-net

Well I didn't say that ;)

It's complicated.

I'm thinking. Do we really need this sequence number to be 64 bits?

Why not have a rotating 32 bit sequence number? It should work
providing the array capacity is < 32 bits?

On Jul 7, 3:26 pm, Olivier Deheurles <m...@odeheurles.com> wrote:
> James,
>
> Ok so there is no point trying to apply the barriers manually in that case
>
> Olivier
>

> >> Olivier- Hide quoted text -
>
> - Show quoted text -

Olivier

unread,

Jul 7, 2011, 5:21:42 AM7/7/11

to Disruptor-net

A lot of code is based on the fact that sequence is incrementing and
Int32.MaxValue is a value that you can reach in real use cases - good
luck to reach Int64.MaxValue ;-)

It would probably require quite a lot of additional code to handle a
rotating sequence number, not sure we would save anything (and it
would add complexity).

Martin Thompson

unread,

Jul 9, 2011, 9:24:42 PM7/9/11

to Disruptor-net

Initially our sequence numbers were 32-bit and had code to cope with
the wrap which was fine. We moved to 64-bit because we started
archiving the sequences to a database for some things we audited. We
then decided for consistency to make them all 64-bit.

Martin Thompson

unread,

Jul 9, 2011, 9:29:54 PM7/9/11

to Disruptor-net

I'm not 100% on the C# semantics for the memory model but are these
barriers in the right place? Should a read barrier not happen before
the read and a write not happen after a write to ensure the store and
write combining buffers are flushed? I might be missing something
with the "ref" qualifier?

Olivier

unread,

Jul 12, 2011, 8:07:23 AM7/12/11

to Disruptor-net

Hi Martin,

I confirm what James said, the implementation of those methods in
the .NET Framework is (reflected code):

[MethodImpl(MethodImplOptions.NoInlining)]
public static long VolatileRead(ref long address)
{
long num = address;
MemoryBarrier();
return num;
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static void VolatileWrite(ref long address, long value)
{
MemoryBarrier();
address = value;
}

The ref keyword just passes the argument by reference.

Volatile of MSDN (http://msdn.microsoft.com/en-us/library/
aa645755%28v=vs.71%29.aspx)
- A volatile read has "acquire semantics"; that is, it is guaranteed
to occur prior to any references to memory that occur after it in the
instruction sequence (this is why the barrier appears after the read).
- A volatile write has "release semantics"; that is, it is guaranteed
to happen after any memory references prior to the write instruction
in the instruction sequence. (this is why the barrier appears before
the write).

Note that in .NET, a store of a volatile variable followed by a load
of another volatile variable can be reordered: you have to apply the
fence manually... tricky.

Gravitas

unread,

Aug 10, 2011, 6:10:04 AM8/10/11

to Disruptor-net

On my machine, an Intel i7-2600, I'm getting ~15 million ops per
second, for UniCast1P1C.

None of the other values were below 11 million ops per second, and
batch was at 30 million ops per second.

Reply all

Reply to author

Forward