On my machine, with buffer pool:
message size: 10 [B]
message count: 10,000,000
mean throughput: 1,904,399.16 [msg/s]
mean throughput: 152.35 [Mb/s]
Without:
message size: 10 [B]
message count: 10,000,000
mean throughput: 3,037,667.07 [msg/s]
mean throughput: 243.01 [Mb/s]
That was interesting, then I realized that the actual buffer sizes were minuscule.
With 10 bytes, there isn't really anything that the GC needs to do, especially if the buffers are short lived.
I then decided to run this with a 2KB messages and 1 million messages.
Without:
message size: 2,048 [B]
message count: 1,000,000
mean throughput: 112,803.16 [msg/s]
mean throughput: 1,848.17 [Mb/s]
With:
message size: 2,048 [B]
message count: 1,000,000
mean throughput: 111,831.8 [msg/s]
mean throughput: 1,832.25 [Mb/s]
So now we are looking at effective parity.
I then tried it with 8,500 bytes.
With buffer pool:
message size: 8,500 [B]
message count: 1,000,000
mean throughput: 26,215.76 [msg/s]
mean throughput: 1,782.67 [Mb/s]
Without buffer pool:
message size: 8,500 [B]
message count: 1,000,000
mean throughput: 26,735.11 [msg/s]
mean throughput: 1,817.99 [Mb/s]
Then I realized, what you are actually doing is pretty basic stuff, with a single buffer checked out at any given time.
And given that pattern, I decided to write a truly stupid buffer manager, just to see if that would work:
public static class BufferPool
{
[ThreadStatic] private static byte[] _buffer;
public static byte[] Take(int size)
{
var buffer = _buffer;
if (buffer == null)
{
return new byte[size];
}
if (buffer.Length < size)
return new byte[size];
_buffer = null;
return buffer;
}
public static void Return(byte[] buffer)
{
if (_buffer == null)
{
_buffer = buffer;
return;
}
if (buffer.Length > _buffer.Length)
{
_buffer = buffer;
return;
}
}
}
With it:
message size: 8,500 [B]
message count: 1,000,000
mean throughput: 28,314.17 [msg/s]
mean throughput: 1,925.36 [Mb/s]
message size: 2,048 [B]
message count: 1,000,000
mean throughput: 115,207.37 [msg/s]
mean throughput: 1,887.56 [Mb/s]
message size: 10 [B]
message count: 10,000,000
mean throughput: 2,912,904.17 [msg/s]
mean throughput: 233.03 [Mb/s]
What this looks like is that for small buffer sizes, it doesn't matter, probably because the GC can just allocate them fast enough that adding any complexity along that path is meaningless.
But when we start talking about larger messages, it really matters.
I tried it with 256Kb buffer, where it really matters (LOH allocations).
With my silly buffering code:
message size: 262,144 [B]
message count: 10,000
mean throughput: 1,857.36 [msg/s]
mean throughput: 3,895.16 [Mb/s]
Without the buffering code:
message size: 262,144 [B]
message count: 10,000
mean throughput: 1,737.62 [msg/s]
mean throughput: 3,644.05 [Mb/s]
Now, note that you test is actually doing a single thread send.
I tried doing the send from 10 threads:
var tasks = new List<Task>();
for (int h = 0; h < 10; h++)
{
tasks.Add(Task.Factory.StartNew(() =>
{
var pushSocket = ZMQ.Socket(context, ZmqSocketType.Push);
pushSocket.Connect(connectTo);
for (int i = 0; i != messageCount / 10; i++)
{
var message = new Msg();
message.InitPool(messageSize);
pushSocket.Send(ref message, SendReceiveOptions.None);
message.Close();
}
pushSocket.Close();
}));
}
Task.WaitAll(tasks.ToArray());
That resulted in (without buffering):
message size: 262,144 [B]
message count: 10,000
mean throughput: 465.59 [msg/s]
mean throughput: 976.42 [Mb/s]
With silly buffering:
message size: 262,144 [B]
message count: 10,000
mean throughput: 1,048.44 [msg/s]
mean throughput: 2,198.73 [Mb/s]
With BufferManager:
message size: 262,144 [B]
message count: 10,000
mean throughput: 2,700.51 [msg/s]
mean throughput: 5,663.39 [Mb/s]
So here we are actually able to take advantage of the buffering, by reusing buffering. When we are running in single threaded mode, you don't see that at all.
But when you are looking at the multi threaded scenario, the results are far more dramatic.
I am not an expert in NetMQ actual usage scenarios, but I think that this usage scenario would be far more common (multiple concurrent sockets in a process) than a single socket running a single thread.
thoughts?