Conversant Disruptor

fredri...@scila.se

unread,

Feb 18, 2016, 4:23:08 AM2/18/16

to mechanical-sympathy

Hello, we are evaluating the Conversant Disruptor as a drop in replacement for other BlockingQueue implementations in multi producer, multi consumer scenarios, when the number of queues, producers and consumers are limited. In our tests the performance looks very promising. There are also JMH tests available. Anyone using this library? Is it production ready (despite the current version number it doesn't seem to have been around for long)?

Martin Thompson

unread,

Feb 18, 2016, 11:48:28 AM2/18/16

to mechanica...@googlegroups.com

Just reading the blurb on the website I would have to doubt the tests are measuring what they think they are. For example they say they can transfer in a mean of 10ns with some taking 5ns. Given that the absolute minimum for a dirty hit cache request between cores on the same socket is 60 cycles with Intel CPUs then at 3 GHz this would take at least 20ns.

On 18 February 2016 at 09:23, <fredri...@scila.se> wrote:

Hello, we are evaluating the Conversant Disruptor as a drop in replacement for other BlockingQueue implementations in multi producer, multi consumer scenarios, when the number of queues, producers and consumers are limited. In our tests the performance looks very promising. There are also JMH tests available. Anyone using this library? Is it production ready (despite the current version number it doesn't seem to have been around for long)?

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dr Heinz M. Kabutz

unread,

Feb 18, 2016, 12:39:20 PM2/18/16

to mechanica...@googlegroups.com

Agreed. Maybe HotSpot is eliminating their microbenchmark?

--
Dr Heinz M. Kabutz (PhD CompSci)
Author of "The Java(tm) Specialists' Newsletter"
Sun/Oracle Java Champion
JavaOne Rockstar Speaker
http://www.javaspecialists.eu
Tel: +30 69 75 595 262
Skype: kabutz

Martin Thompson

unread,

Feb 18, 2016, 12:46:14 PM2/18/16

to mechanica...@googlegroups.com

Or the cache subsystem is running at 12 GHz and CPU instructions like fences don't cost any cycles at all. :-)

When I get time I'll have a look out of curiosity. More often even than Hotspot eliminating code I see people think that latency is 1 / throughput. It's very common mistake.

Kirti Teja Rao

unread,

Feb 18, 2016, 10:00:41 PM2/18/16

to mechanica...@googlegroups.com

The tests seem to measure mean time to offer or mean time to poll rather than latency. Also I have seen misleading numbers, like 1.5x better than they should be, if threads are not pinned to different cores when measuring latencies or throughput for queues.

--

John Cairns

unread,

Mar 4, 2016, 12:08:08 PM3/4/16

to mechanical-sympathy

Hi Fred,

Thanks for your interest in Conversant Disruptor! Conversant has been using this in production since 2012 and the performance is excellent. The BlockingQueue implementation is very stable, although we continue to tune and improve it. The latest release, 1.2.4, is 100% production ready.

Although we have been working on it for a long time, we decided to open source our BlockingQueue this year to contribute something back to the community.

Feel free to reach out to me directly via email if you have any questions about Conversant Disruptor or need any resources. As you say, its a drop in for BlockingQueue, so its a very easy test. Conversant Disruptor will crush ArrayBlockingQueue and LinkedTransferQueue for thread to thread transfers.

In our system, we noticed a 10-20% reduction in overall system load and latency when we introduced it.

John Cairns

unread,

Mar 4, 2016, 12:33:47 PM3/4/16

to mechanical-sympathy

Martin,

Thanks sincerely for your reply and your skepticism. Feedback from some of the top minds in the field was certainly something we hoped for when we open sourced our code. OP mentioned that he is looking for a multi-consumer multi-producer queue. In that case we measure roughly 20-40ns transfers, vs LMAX 50ns or more. I have also found that our Queue performs better than both Java BlockingQueue implementations and LMAX in pretty much every measurement we have done, not just the JMH benchmark that we open sourced. OP stated that he is finding good performance in _his_ measurements. I'm not surprised that he is.

I think your skepticism is understandable, but I don't think it warrants dismissing Conversant Disruptor out of hand either. I'd encourage everyone to download the code and try it out. We open sourced our code so that everyone can benefit and contribute. http://bit.ly/D15ruptor

A nice feature of our implementation is people can use this Disruptor as a drop in replacement for a Java BlockingQueue, so even if in some scenarios Conversant has same performance as LMAX, users don't have to change their code to incorporate our queue.

In the announcement blog post, I specifically pointed out that I don't think Conversant Disruptor is somehow an alternative or in competition with LMAX. I think they are two different approaches to the same problem. Some people might like the event model of LMAX Disruptor, others might like the convenience of a Java BlockingQueue. I say tomato.

Here is that link in case you missed it: http://bit.ly/ConvDisr

John Cairns

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Martin Thompson

unread,

Mar 4, 2016, 12:47:16 PM3/4/16

to mechanica...@googlegroups.com

Hi John,

I love finding new approaches that can reduce latency and increase throughput. I'm skeptical when people make claims that I know are not possible on given hardware. I see a transfer as the exchange of a data item from one thread/process to another in a correct fashion. You are now claiming 20-40ns in this thread when your website claimed 5-10ns which is absolutely not possible on Intel CPUs. I wish CPUs could exchange data between cores this fast but sadly they cannot :-)

To have a fair comparison you could add a benchmark to this set of benchmarks which will be comparing apples with apples. It would be a very simple thing for you to do. If you have a great implementation then that will be wonderful.

https://github.com/real-logic/benchmarks

I'd love to see what results you get.

Regards,

Martin...

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Jahnsson Niklas

unread,

Mar 6, 2016, 3:44:36 AM3/6/16

to mechanical-sympathy

Hey,

Had a brief look at the code for the conversant disruptor and I don't think the that the MultitheadConcurrentQueue is publishing the changes to the buffer in a thread safe manner. After all, it could be that once you have claimed the position another thread has not yet read from it so it is not free for writing. Or am I missing something here? (John can probably explain what is exactly happening) I also believe the same issue is there when reading the value as somebody could be writing to it at the same time.

Also compared to disruptor, conversant disruptor doesn't allocate memory in the beginning when instantiating the queue, but instead the objects are allocated by the producers. The queue only contains references to them. I think this is just a different programming model than what is actually used in the disruptor.

Best,

Niklas

Martin Thompson

unread,

Mar 6, 2016, 4:16:09 AM3/6/16

to mechanica...@googlegroups.com

I had a look at the algorithm and it appears to not work in the multi producer scenario at least.

Take the following case.

Producing thread one claims sequence 1 with a CAS on tailCursor, it then takes a interrupt and has not yet set the value or the tail. Let's assume the interrupt happens on the following line.

https://github.com/conversant/disruptor/blob/master/src/main/java/com/conversantmedia/util/concurrent/MultithreadConcurrentQueue.java#L123

Producing thread two claims sequence 2 with a CAS on tailCursor, it then writes into slot and updates tail with value of sequence 2.

Now along comes a consumer and it sees tail at sequence 2 and claims sequence 1. It then reads the slot and updates the head to sequence 1. It returns null as producing thread one is still interrupted.

Then producing thread one gets scheduled to run and it writes in the slot and updates the tail to 1. This causes a lost updated and sets the tail to the wrong value as 2 is now also undone.

A simple stress should show up that the invariants for this queue would not hold. Unless I've not had enough coffee this morning I do not see how this FIFO is a correct implementation. I suspect in the real world it will work most of the time and occasionally lose items.

Regards,

Martin...

--

Anthony Maire

unread,

Mar 7, 2016, 4:04:02 AM3/7/16

to mechanical-sympathy

Hi Martin,

First of all, since it's my first message on this group, let me thank you and all the regular members of the group for the valuable informations you are sharing here, it helped me a lot to better understand low-level stuff.

I think the case you are describing is not possible (unless I've not had enough coffee too) : the CAS on tailCursor compares with tail current value, so producer 2 should not be able to pass the CAS until tail has been updated by producer 1.

The algorithm seems thread-safe to me at first sight, but I guess it may be less efficient under very high contention than algorithms where the only synchronization point between producers is the CAS itself and does not include other instructions.

If it is effectively correct, having a benchmark to compare it to other implementation might be interesting, since guessing does not prove anything.

Regards,

Anthony

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Jahnsson Niklas

unread,

Mar 7, 2016, 4:56:00 AM3/7/16

to mechanical-sympathy

Hey Anthony,

You write

"the CAS on tailCursor compares withtail current value, so producer 2 should not be able to pass the CAS until tail has been updated by producer 1."

So once producer 1 has passed and is interrupted, producer 2 can pass CAS.

The example Martin gave is valid I think.

-niklas

Simone Bordet

unread,

Mar 7, 2016, 5:12:07 AM3/7/16

to mechanica...@googlegroups.com

Hi,

Not to beat a dead horse, but poll() is broken in the same way, and
same goes for other methods such as remove() and size().

The broken idiom is to try to update 2 atomic fields thinking that a
successful CAS to one field atomically guards the other field, but
that is obviously not true, like Martin showed.

--
Simone Bordet
http://bordet.blogspot.com
---
Finally, no matter how good the architecture and design are,
to deliver bug-free software with optimal performance and reliability,
the implementation technique must be flawless. Victoria Livschitz

Anthony Maire

unread,

Mar 7, 2016, 5:13:12 AM3/7/16

to mechanical-sympathy

Let's take make it clearer with an example : let's take a brand new instance, so tail = tailCursor = 0

Producer 1 starts publishing and is interrupted just after the CAS, so tail is not updated, we have tailCursor=1 and tail=0

Producer 2 read tail (still 0) into tailSeq, compute tailNext =tailSeq + 1 (1), and tries tailCursor.CAS(tailSeq, tailNext) i.e tailCursor.CAS(0,1) => fails since tailCursor has been set to 1 by producer 1's CAS

Once producer 1 calls tail.lazySet() and the update is visible to producer 2, it will tries to do tailCursor.CAS(1, 2) and will eventually pass the CAS

Anthony

Simone Bordet

unread,

Mar 7, 2016, 5:36:59 AM3/7/16

to mechanica...@googlegroups.com

Hi,

On Mon, Mar 7, 2016 at 11:13 AM, Anthony Maire <maire....@gmail.com> wrote:
> Let's take make it clearer with an example : let's take a brand new
> instance, so tail = tailCursor = 0
>
> Producer 1 starts publishing and is interrupted just after the CAS, so tail
> is not updated, we have tailCursor=1 and tail=0

Right, you use value of tail and not that of tailCursor to update tailCursor.
I missed that, so I take back what I said in my earlier email.

I'm still not sure it is right; e.g. I still think that size() may
return negative values and that a concurrent poll() may interfere with
remove(E[]), so I need at least to look at this in more details.

Anthony Maire

unread,

Mar 7, 2016, 5:43:42 AM3/7/16

to mechanica...@googlegroups.com

I did not use anything, since I'm not the author of the algorithm, I missed this subtle point when I first read it too ;) I'm not claiming I'm 100% sure the whole class is right since I have read only offer/poll methods, but the point that was raised earlier seems fine to me

However, I had a second look on the code, and there is something that seems probably broken: the headCache / tailCache fields

If 2 producers are trying to publish, one of them can modify the value and another can read it ... but get an inconsistent value (cf JLS 17.7) since it is a non-volatile 64-bit value (so no atomic write).

Maybe it can lead to a producer passing the "queue full" test where it shouldn't

Martin Thompson

unread,

Mar 7, 2016, 5:53:11 AM3/7/16

to mechanica...@googlegroups.com

Hi Anthony,

You are right. I missed that it re-reads the tail if the CAS fails. This makes my observation incorrect. It does however make the operation blocking between producers. The Disruptor 1.x-2.x had the same issue and caused very bad latency outliers. This was addressed in Disruptor 3.0. If the thread that succeeds with the CAS is then interrupted before it can set the tail then all other producers cannot progress until it does, they are blocked.

The best thing to do it to write a stress test to see if the invariants hold.

Regards,

Martin...

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Nitsan Wakart

unread,

Mar 7, 2016, 7:18:05 AM3/7/16

to mechanica...@googlegroups.com

Disclaimer, I'm the main author of JCTools.

Having a read of the code(offer/poll/size), some observations:

- tail/headCache being plain load/stores will lead to value tearing on 32bit platforms, making the values nonsense. This is easy to fix by using the padded atomic variant and using lazySet. Or you can add a javadoc/release note that the classes only work on 64bit.

- all the padded classes are only half padded, leaving them open to false share with data to their left

- size() is broken. Can return negative values.

The approach using the ref+cursor is interesting, but I think the D.Vyukov algorithm (implemented in JCTools MPMC) is better. It removes the tailRef and replaces it with the use of a per slot sequence. This also allows for improved performance in the relaxed offer/poll cases.

JCTools doesn't offer blocking queues at the moment, I've been too busy to push them from experimental to core, but the code should be usable and may offer an interesting option to people. If there's great demand I can push the inclusion of the blocking queues up my priority list.

Comparisons with the Disruptor as a queue miss the fact that it offers a range of features (object reuse, event broadcasting, seq or parallel pipeline stages) missing from queues which make it the best choice for the usecases where the features are required. Using the Disruptor as a generic queue an anti-pattern IMO, and as such the comparison makes for a bit of a straw-man argument.

If you are looking for some JMH queue benchmarks measuring latency/throughput there's a reasonably well used and well reviewed set of benchmarks in JCTools.

Have fun storming the castle :-)

Fredrik Lydén

unread,

Mar 7, 2016, 8:28:53 AM3/7/16

to mechanica...@googlegroups.com

Thanks for all input! At least our stress tests of poll, offer, put, take and drain pass (the size method is not included in the tests).

Nitsan, +1 for including the blocking queue support from jctools-experimental in core. We are using the SPSC and MPSC queue implementations from JCTools already and they work excellent.

And I understand that there are use cases where LMAX disruptor is a better choice, but in this case we needed a drop-in BlockingQueue implementation, preferably possible to create in Spring configuration.

Thanks,

Fredrik

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/c5x0c2Zsfpc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

CTO
Scila AB
Sveavägen 25, 8TR
111 34 Stockholm
Sweden

Direct: +46 8 546 402 92
Mobile: +46 73 707 30 11
fredri...@scila.se
www.scila.se

John Cairns

unread,

Mar 7, 2016, 1:06:50 PM3/7/16

to mechanical-sympathy, nit...@yahoo.com

Nitsan,

Thanks for the suggestions. I will pad both sides of the Padded values and add a release note about this code being specialized for 64bit hardware and JVMs.

Can you explain how size() is broken? tail >= head therefore tail - head >= 0

Thanks,

John

Nitsan Wakart

unread,

Mar 7, 2016, 2:24:43 PM3/7/16

to John Cairns, mechanical-sympathy

size = tail - head only works as expected when the queue is 'inert'

imagine:

T1: tailVar = tail; and suspend

T2,T3: offer/poll as many times as you like

T1: headVar = head;

tail is no longer >= head.

See JCTools for one way of solving it.

Vitaly Davidovich

unread,

Mar 7, 2016, 3:29:32 PM3/7/16

to mechanical-sympathy

What's the rationale behind the (unbounded) looping in https://github.com/JCTools/JCTools/blob/master/jctools-core/src/main/java/org/jctools/queues/MpmcArrayQueue.java#L221 to establish a stable snapshot? Why not just return Math.max(0, tail - head), where `tail' and `head' are captured into locals in whichever order yields the desired over/under estimation? The "effort" spent on getting a stable snapshot seems of questionable value given it's (a) protecting pre-emption in fairly narrow range of ops, (b) still yields just an estimate (not to its fault, of course), and (c) can loop indeterminate number of times (although will terminate quickly in practice in all likelihood). Given that size() in such collections can, at most, be used for monitoring/introspection, spending any additional effort to improve its accuracy in the face of preemption doesn't seem worthwhile. What am I missing?

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Nitsan Wakart

unread,

Mar 8, 2016, 11:13:43 AM3/8/16

to mechanica...@googlegroups.com

Nothing, it 's an effort at reporting a size which the queue, at least at some point near the present, had.

Is the effort worth while? As you point out, it's likely to exit the loop quickly enough in any real scenario where the thread executing size() is not interrupted midway, so the effort from a computational POV is not much, especially given that size() is a monitoring method, likely to be called from a less critical thread at fairly long intervals.

As for implementation/complexity cost, it's done and not so complex I think.

John Cairns

unread,

Mar 9, 2016, 3:17:29 PM3/9/16

to mechanical-sympathy

I added Conversant Disruptor to your benchmark and it looks pretty favorable. I don't think your "burst = 1" measures anything other than test overhead. But here are the Burst = 100 results for 1 producer. Multiple producers also fair much better than "Disruptor"

Disruptor:

Percentiles, ns/op:

p(0.0000) = 2112.000 ns/op

p(50.0000) = 2312.000 ns/op

p(90.0000) = 4076.000 ns/op

p(95.0000) = 4152.000 ns/op

p(99.0000) = 4368.000 ns/op

p(99.9000) = 12432.000 ns/op

p(99.9900) = 41698.899 ns/op

p(99.9990) = 3185660.969 ns/op

Conversant Disruptor

Percentiles, ns/op:

p(0.0000) = 1674.000 ns/op

p(50.0000) = 1848.000 ns/op

p(90.0000) = 2304.000 ns/op

p(95.0000) = 3204.000 ns/op

p(99.0000) = 3272.000 ns/op

p(99.9000) = 10672.000 ns/op

p(99.9900) = 17868.301 ns/op

p(99.9990) = 486270.789 ns/op

I read 20-30ns, exactly on par with our benchmarking in the multithread case.

Here is the summary table. Coversant Disruptor fairs pretty well in every case.

Benchmark (burstLength) Mode Cnt Score Error Units

DisruptorBenchmark.test1Producer 1 sample 345870 134.733 ± 0.884 ns/op

DisruptorBenchmark.test1Producer 100 sample 281436 2759.219 ± 156.416 ns/op

DisruptorBenchmark.test2Producers 1 sample 642211 437.281 ± 366.231 ns/op

DisruptorBenchmark.test2Producers 100 sample 570393 11048.353 ± 1798.365 ns/op

DisruptorBenchmark.test3Producers 1 sample 730151 985.686 ± 483.954 ns/op

DisruptorBenchmark.test3Producers 100 sample 650905 23529.651 ± 2739.922 ns/op

DisruptorBlockingQueueBenchmark.test1Producer 1 sample 313751 146.263 ± 1.033 ns/op

DisruptorBlockingQueueBenchmark.test1Producer 100 sample 378077 2146.201 ± 148.813 ns/op

DisruptorBlockingQueueBenchmark.test2Producers 1 sample 605957 529.000 ± 339.598 ns/op

DisruptorBlockingQueueBenchmark.test2Producers 100 sample 613663 6672.167 ± 1225.265 ns/op

DisruptorBlockingQueueBenchmark.test3Producers 1 sample 808469 893.126 ± 411.736 ns/op

DisruptorBlockingQueueBenchmark.test3Producers 100 sample 815679 14184.251 ± 1825.509 ns/op

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Martin Thompson

unread,

Mar 9, 2016, 6:32:56 PM3/9/16

to mechanica...@googlegroups.com

John,

Can you send a pull request for the benchmark you have written?

Martin...

--

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,

Mar 9, 2016, 8:09:49 PM3/9/16

to mechanica...@googlegroups.com

Right, it's not complex but has a slight head scratching element to it as it looks to be achieving something pointless. It's head scratching in that it's not the obvious implementation, which ought to make anyone reading the code pause (unnecessarily, IMO) for just a second.

A size of 0 coming from Math.max(0, delta) is just as fair of an answer as the loop because clearly size=0 occurred "at some point near the present". I've seen this paradigm before, and always wondered why people go through the (admittedly little) effort in an attempt to present something of no more value than the dirt simple and obvious version with a slightly more involved construct.

--
Sent from my phone

Vitaly Davidovich

unread,

Mar 9, 2016, 8:22:41 PM3/9/16

to mechanica...@googlegroups.com

I haven't looked at the disruptor benchmark suite, so the disclaimer is I may say something dumb, in which case please correct me.

As someone mentioned upthread, Disruptor is also the storage backing the items exchanged (with a memory indirection to boot) - publishing an item involves loading that storage and performing translation (copying). The conversant disruptor is a queue where caller has already prepared the object for publication. Does the benchmark account for that? If it's not accounted for, it's a bit of apples to tomatoes comparison; that's not to take anything away from conversant on its own, just likely should either not compare them or call out this difference, IMO.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Sent from my phone

Benedict Elliott Smith

unread,

Mar 9, 2016, 8:37:55 PM3/9/16

to mechanica...@googlegroups.com

Not disputing the sense of your approach Vitaly - I tend to see little point in producing a stable result for something so ephemeral. But I'm pretty sure the calculation you provide doesn't guarantee that zero actually did occur, just that it may have done. i.e. that some interleaving of a subset of the mutations whose executions overlapped with the call to size could have produced the value you see (not that any such actual subset did, for any value, including zero).

Vitaly Davidovich

unread,

Mar 9, 2016, 8:47:05 PM3/9/16

to mechanica...@googlegroups.com

Perhaps this is a bit philosophical, but 0 did occur if you're negative now; at some point, you took a snapshot of head (or tail), then got preempted. While preempted, the other side advanced on the number line crossing 0 at some point - you just didn't see that point due to scheduling. But my point really is that this semantic difference, if any, is devoid of any real meaning for things like concurrent size(). As you agree, *any* value returned is transient. The key is it's not a bogus value (it's within the spec range) and is not out-of-thin air (it's a real observable value, even if scheduling didn't let you see that precisely).

Nitsan Wakart

unread,

Mar 10, 2016, 1:42:20 AM3/10/16

to mechanica...@googlegroups.com

0 didn't have to happen for you to see 0 or negative values.

You can have producer/consumer progress in lockstep and 'see' 0 or negative despite the fact the the queue size has remained effectively the same throughout the time of your observation.

Nitsan Wakart

unread,

Mar 10, 2016, 2:13:59 AM3/10/16

to mechanica...@googlegroups.com

John,

The benchmark measures the latency of a burst of messages + cost of signaling back. This is close to 'latency' when you are sending 1 message. The cost of discovery/initial signalling is quite high. This is not the benchmark lying to you, or benchmark overhead when you are looking at a single message. Sending the first message is the most expensive case as there's the least oppurtunity for amortizing costs (e.g. cache miss on the producer index, cache miss on the array elements cache line etc).

"I read 20-30ns, exactly on par with our benchmarking in the multithread case"

The cost of sending 100 messages is NOT something you divide by 100 to find the latency of sending 1 message. This is the same as sending 100 people in a bus and hoping each one will arrive in 1% of the time it take the bus to get wherever. Cost/latency are not the same thing. Consider the plain costs here, producer will write out to LLC, consumer will read from LLC, that alone with no other overheads or contention will be more than 30ns. I'm not sure what the lowest possible latency between cores is but there's no software solution to hardware limitations.

"Benchmark (burstLength) Mode Cnt Score Error Units