Conversant Disruptor

1,647 views
Skip to first unread message

fredri...@scila.se

unread,
Feb 18, 2016, 4:23:08 AM2/18/16
to mechanical-sympathy
Hello, we are evaluating the Conversant Disruptor as a drop in replacement for other BlockingQueue implementations in multi producer, multi consumer scenarios, when the number of queues, producers and consumers are limited. In our tests the performance looks very promising. There are also JMH tests available. Anyone using this library? Is it production ready (despite the current version number it doesn't seem to have been around for long)?

Martin Thompson

unread,
Feb 18, 2016, 11:48:28 AM2/18/16
to mechanica...@googlegroups.com
Just reading the blurb on the website I would have to doubt the tests are measuring what they think they are. For example they say they can transfer in a mean of 10ns with some taking 5ns. Given that the absolute minimum for a dirty hit cache request between cores on the same socket is 60 cycles with Intel CPUs then at 3 GHz this would take at least 20ns.


On 18 February 2016 at 09:23, <fredri...@scila.se> wrote:
Hello, we are evaluating the Conversant Disruptor as a drop in replacement for other BlockingQueue implementations in multi producer, multi consumer scenarios, when the number of queues, producers and consumers are limited. In our tests the performance looks very promising. There are also JMH tests available. Anyone using this library? Is it production ready (despite the current version number it doesn't seem to have been around for long)?

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dr Heinz M. Kabutz

unread,
Feb 18, 2016, 12:39:20 PM2/18/16
to mechanica...@googlegroups.com
Agreed.  Maybe HotSpot is eliminating their microbenchmark?
--
Dr Heinz M. Kabutz (PhD CompSci)
Author of "The Java(tm) Specialists' Newsletter"
Sun/Oracle Java Champion
JavaOne Rockstar Speaker
http://www.javaspecialists.eu
Tel: +30 69 75 595 262
Skype: kabutz

Martin Thompson

unread,
Feb 18, 2016, 12:46:14 PM2/18/16
to mechanica...@googlegroups.com
Or the cache subsystem is running at 12 GHz and CPU instructions like fences don't cost any cycles at all. :-)

When I get time I'll have a look out of curiosity. More often even than Hotspot eliminating code I see people think that latency is 1 / throughput. It's very common mistake.

Kirti Teja Rao

unread,
Feb 18, 2016, 10:00:41 PM2/18/16
to mechanica...@googlegroups.com
The tests seem to measure mean time to offer or mean time to poll rather than latency. Also I have seen misleading numbers, like 1.5x better than they should be, if threads are not pinned to different cores when measuring latencies or throughput for queues.

--

John Cairns

unread,
Mar 4, 2016, 12:08:08 PM3/4/16
to mechanical-sympathy
Hi Fred, 

Thanks for your interest in Conversant Disruptor!    Conversant has been using this in production since 2012 and the performance is excellent.    The BlockingQueue implementation is very stable, although we continue to tune and improve it.  The latest release, 1.2.4, is 100% production ready.

Although we have been working on it for a long time, we decided to open source our BlockingQueue this year to contribute something back to the community.

Feel free to reach out to me directly via email if you have any questions about Conversant Disruptor or need any resources.  As you say, its a drop in for BlockingQueue, so its a very easy test.       Conversant Disruptor will crush ArrayBlockingQueue and LinkedTransferQueue for thread to thread transfers.    

In our system, we noticed a 10-20% reduction in overall system load and latency when we introduced it.

John Cairns

John Cairns

unread,
Mar 4, 2016, 12:33:47 PM3/4/16
to mechanical-sympathy
Martin,  

Thanks sincerely for your reply and your skepticism.   Feedback from some of the top minds in the field was certainly something we hoped for when we open sourced our code.   OP mentioned that he is looking for a multi-consumer multi-producer queue.  In that case we measure roughly 20-40ns transfers, vs LMAX 50ns or more.     I have also found that our Queue performs better than both Java BlockingQueue implementations and LMAX in pretty much every measurement we have done, not just the JMH benchmark that we open sourced.    OP stated that he is finding good performance in _his_ measurements.   I'm not surprised that he is.

I think your skepticism is understandable, but I don't think it warrants dismissing Conversant Disruptor out of hand either.   I'd encourage everyone to download the code and try it out.     We open sourced our code so that everyone can benefit and contribute.   http://bit.ly/D15ruptor

A nice feature of our implementation is people can use this Disruptor as a drop in replacement for a Java BlockingQueue, so even if in some scenarios Conversant has same performance as LMAX, users don't have to change their code to incorporate our queue.

In the announcement blog post, I specifically pointed out that I don't think Conversant Disruptor is somehow an alternative or in competition with LMAX.   I think they are two different approaches to the same problem.    Some people might like the event model of LMAX Disruptor, others might like the convenience of a Java BlockingQueue.   I say tomato.

Here is that link in case you missed it: http://bit.ly/ConvDisr

John Cairns
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Martin Thompson

unread,
Mar 4, 2016, 12:47:16 PM3/4/16
to mechanica...@googlegroups.com
Hi John,

I love finding new approaches that can reduce latency and increase throughput. I'm skeptical when people make claims that I know are not possible on given hardware. I see a transfer as the exchange of a data item from one thread/process to another in a correct fashion. You are now claiming 20-40ns in this thread when your website claimed 5-10ns which is absolutely not possible on Intel CPUs. I wish CPUs could exchange data between cores this fast but sadly they cannot :-)

To have a fair comparison you could add a benchmark to this set of benchmarks which will be comparing apples with apples. It would be a very simple thing for you to do. If you have a great implementation then that will be wonderful.


I'd love to see what results you get.

Regards,
Martin...


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Jahnsson Niklas

unread,
Mar 6, 2016, 3:44:36 AM3/6/16
to mechanical-sympathy
Hey,

Had a brief look at the code for the conversant disruptor and I don't think the that the MultitheadConcurrentQueue is publishing the changes to the buffer in a thread safe manner. After all, it could be that once you have claimed the position another thread has not yet read from it so it is not free for writing. Or am I missing something here? (John can probably explain what is exactly happening) I also believe the same issue is there when reading the value as somebody could be writing to it at the same time.

Also compared to disruptor, conversant disruptor doesn't allocate memory in the beginning when instantiating the queue, but instead the objects are allocated by the producers. The queue only contains references to them. I think this is just a different programming model than what is actually used in the disruptor.

Best,
Niklas

Martin Thompson

unread,
Mar 6, 2016, 4:16:09 AM3/6/16
to mechanica...@googlegroups.com
I had a look at the algorithm and it appears to not work in the multi producer scenario at least.

Take the following case.

Producing thread one claims sequence 1 with a CAS on tailCursor, it then takes a interrupt and has not yet set the value or the tail. Let's assume the interrupt happens on the following line.


Producing thread two claims sequence 2 with a CAS on tailCursor, it then writes into slot and updates tail with value of sequence 2.

Now along comes a consumer and it sees tail at sequence 2 and claims sequence 1. It then reads the slot and updates the head to sequence 1. It returns null as producing thread one is still interrupted.

Then producing thread one gets scheduled to run and it writes in the slot and updates the tail to 1. This causes a lost updated and sets the tail to the wrong value as 2 is now also undone.

A simple stress should show up that the invariants for this queue would not hold. Unless I've not had enough coffee this morning I do not see how this FIFO is a correct implementation. I suspect in the real world it will work most of the time and occasionally lose items.

Regards,
Martin...


--

Anthony Maire

unread,
Mar 7, 2016, 4:04:02 AM3/7/16
to mechanical-sympathy
Hi Martin,

First of all, since it's my first message on this group, let me thank you and all the regular members of the group for the valuable informations you are sharing here, it helped me a lot to better understand low-level stuff.

I think the case you are describing is not possible (unless I've not had enough coffee too) : the CAS on tailCursor compares with tail current value, so producer 2 should not be able to pass the CAS until tail has been updated by producer 1.
The algorithm seems thread-safe to me at first sight, but I guess it may be less efficient under very high contention than algorithms where the only synchronization point between producers is the CAS itself and does not include other instructions.
If it is effectively correct, having a benchmark to compare it to other implementation might be interesting, since guessing does not prove anything.

Regards,
Anthony
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Jahnsson Niklas

unread,
Mar 7, 2016, 4:56:00 AM3/7/16
to mechanical-sympathy
Hey Anthony,

You write

"the CAS on tailCursor compares withtail current value, so producer 2 should not be able to pass the CAS until tail has been updated by producer 1."

So once producer 1 has passed and is interrupted, producer 2 can pass CAS.

The example Martin gave is valid I think.

-niklas

Simone Bordet

unread,
Mar 7, 2016, 5:12:07 AM3/7/16
to mechanica...@googlegroups.com
Hi,
Not to beat a dead horse, but poll() is broken in the same way, and
same goes for other methods such as remove() and size().

The broken idiom is to try to update 2 atomic fields thinking that a
successful CAS to one field atomically guards the other field, but
that is obviously not true, like Martin showed.

--
Simone Bordet
http://bordet.blogspot.com
---
Finally, no matter how good the architecture and design are,
to deliver bug-free software with optimal performance and reliability,
the implementation technique must be flawless. Victoria Livschitz

Anthony Maire

unread,
Mar 7, 2016, 5:13:12 AM3/7/16
to mechanical-sympathy
Let's take make it clearer with an example : let's take a brand new instance, so tail = tailCursor = 0

Producer 1 starts publishing and is interrupted just after the CAS, so tail is not updated, we have tailCursor=1 and tail=0
Producer 2 read tail (still 0) into tailSeq, compute tailNext =tailSeq + 1 (1), and tries tailCursor.CAS(tailSeq, tailNext) i.e tailCursor.CAS(0,1) => fails since tailCursor has been set to 1 by producer 1's CAS
Once producer 1 calls tail.lazySet() and the update is visible to producer 2, it will tries to do tailCursor.CAS(1, 2) and will eventually pass the CAS 

Anthony

Simone Bordet

unread,
Mar 7, 2016, 5:36:59 AM3/7/16
to mechanica...@googlegroups.com
Hi,

On Mon, Mar 7, 2016 at 11:13 AM, Anthony Maire <maire....@gmail.com> wrote:
> Let's take make it clearer with an example : let's take a brand new
> instance, so tail = tailCursor = 0
>
> Producer 1 starts publishing and is interrupted just after the CAS, so tail
> is not updated, we have tailCursor=1 and tail=0

Right, you use value of tail and not that of tailCursor to update tailCursor.
I missed that, so I take back what I said in my earlier email.

I'm still not sure it is right; e.g. I still think that size() may
return negative values and that a concurrent poll() may interfere with
remove(E[]), so I need at least to look at this in more details.

Anthony Maire

unread,
Mar 7, 2016, 5:43:42 AM3/7/16
to mechanica...@googlegroups.com
I did not use anything, since I'm not the author of the algorithm, I missed this subtle point when I first read it too ;) I'm not claiming I'm 100% sure the whole class is right since I have read only offer/poll methods, but the point that was raised earlier seems fine to me

However, I had a second look on the code, and there is something that seems probably broken: the headCache / tailCache fields
If 2 producers are trying to publish, one of them can modify the value and another can read it ... but get an inconsistent value (cf JLS 17.7) since it is a non-volatile 64-bit value (so no atomic write). 
Maybe it can lead to a producer passing the "queue full" test where it shouldn't

Martin Thompson

unread,
Mar 7, 2016, 5:53:11 AM3/7/16
to mechanica...@googlegroups.com
Hi Anthony,

You are right. I missed that it re-reads the tail if the CAS fails. This makes my observation incorrect. It does however make the operation blocking between producers. The Disruptor 1.x-2.x had the same issue and caused very bad latency outliers. This was addressed in Disruptor 3.0. If the thread that succeeds with the CAS is then interrupted before it can set the tail then all other producers cannot progress until it does, they are blocked. 

The best thing to do it to write a stress test to see if the invariants hold.

Regards,
Martin...

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Nitsan Wakart

unread,
Mar 7, 2016, 7:18:05 AM3/7/16
to mechanica...@googlegroups.com
Disclaimer, I'm the main author of JCTools.
Having a read of the code(offer/poll/size), some observations:
- tail/headCache being plain load/stores will lead to value tearing on 32bit platforms, making the values nonsense. This is easy to fix by using the padded atomic variant and using lazySet. Or you can add a javadoc/release note that the classes only work on 64bit.
- all the padded classes are only half padded, leaving them open to false share with data to their left
- size() is broken. Can return negative values.
The approach using the ref+cursor is interesting, but I think the D.Vyukov algorithm (implemented in JCTools MPMC) is better. It removes the tailRef and replaces it with the use of a per slot sequence. This also allows for improved performance in the relaxed offer/poll cases.
JCTools doesn't offer blocking queues at the moment, I've been too busy to push them from experimental to core, but the code should be usable and may offer an interesting option to people. If there's great demand I can push the inclusion of the blocking queues up my priority list.
Comparisons with the Disruptor as a queue miss the fact that it offers a range of features (object reuse, event broadcasting, seq or parallel pipeline stages) missing from queues which make it the best choice for the usecases where the features are required. Using the Disruptor as a generic queue an anti-pattern IMO, and as such the comparison makes for a bit of a straw-man argument.
If you are looking for some JMH queue benchmarks measuring latency/throughput there's a reasonably well used and well reviewed set of benchmarks in JCTools.
Have fun storming the castle :-)

Fredrik Lydén

unread,
Mar 7, 2016, 8:28:53 AM3/7/16
to mechanica...@googlegroups.com
Thanks for all input! At least our stress tests of poll, offer, put, take and drain pass (the size method is not included in the tests).

Nitsan, +1 for including the blocking queue support from jctools-experimental in core. We are using the SPSC and MPSC queue implementations from JCTools already and they work excellent.

And I understand that there are use cases where LMAX disruptor is a better choice, but in this case we needed a drop-in BlockingQueue implementation, preferably possible to create in Spring configuration.

Thanks,
Fredrik


--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/c5x0c2Zsfpc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
CTO
Scila AB
Sveavägen 25, 8TR
111 34 Stockholm
Sweden

Direct: +46 8 546 402 92
Mobile: +46 73 707 30 11
fredri...@scila.se
www.scila.se

John Cairns

unread,
Mar 7, 2016, 1:06:50 PM3/7/16
to mechanical-sympathy, nit...@yahoo.com
Nitsan, 

Thanks for the suggestions.    I will pad both sides of the Padded values and add a release note about this code being specialized for 64bit hardware and JVMs.

Can you explain how size() is broken?       tail >= head therefore tail - head >= 0

Thanks,
John

Nitsan Wakart

unread,
Mar 7, 2016, 2:24:43 PM3/7/16
to John Cairns, mechanical-sympathy
size = tail - head only works as expected when the queue is 'inert'
imagine:
T1: tailVar = tail; and suspend
T2,T3: offer/poll as many times as you like
T1: headVar = head;
tail is no longer >= head.
See JCTools for one way of solving it.

Vitaly Davidovich

unread,
Mar 7, 2016, 3:29:32 PM3/7/16
to mechanical-sympathy
What's the rationale behind the (unbounded) looping in https://github.com/JCTools/JCTools/blob/master/jctools-core/src/main/java/org/jctools/queues/MpmcArrayQueue.java#L221 to establish a stable snapshot? Why not just return Math.max(0, tail - head), where `tail' and `head' are captured into locals in whichever order yields the desired over/under estimation? The "effort" spent on getting a stable snapshot seems of questionable value given it's (a) protecting pre-emption in fairly narrow range of ops, (b) still yields just an estimate (not to its fault, of course), and (c) can loop indeterminate number of times (although will terminate quickly in practice in all likelihood).  Given that size() in such collections can, at most, be used for monitoring/introspection, spending any additional effort to improve its accuracy in the face of preemption doesn't seem worthwhile.  What am I missing? 

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Nitsan Wakart

unread,
Mar 8, 2016, 11:13:43 AM3/8/16
to mechanica...@googlegroups.com
Nothing, it 's an effort at reporting a size which the queue, at least at some point near the present, had.
Is the effort worth while? As you point out, it's likely to exit the loop quickly enough in any real scenario where the thread executing size() is not interrupted midway, so the effort from a computational POV is not much, especially given that size() is a monitoring method, likely to be called from a less critical thread at fairly long intervals.
As for implementation/complexity cost, it's done and not so complex I think.

John Cairns

unread,
Mar 9, 2016, 3:17:29 PM3/9/16
to mechanical-sympathy
I added Conversant Disruptor to your benchmark and it looks pretty favorable.    I don't think your "burst = 1" measures anything other than test overhead.   But here are the Burst = 100 results for 1 producer.   Multiple producers also fair much better than "Disruptor"

Disruptor:

 Percentiles, ns/op:
     p(0.0000) =   2112.000 ns/op
     p(50.0000) =   2312.000 ns/op
     p(90.0000) =   4076.000 ns/op
     p(95.0000) =   4152.000 ns/op
     p(99.0000) =   4368.000 ns/op
     p(99.9000) =  12432.000 ns/op
     p(99.9900) =  41698.899 ns/op
     p(99.9990) = 3185660.969 ns/op

Conversant Disruptor

  Percentiles, ns/op:
     p(0.0000) =   1674.000 ns/op
     p(50.0000) =   1848.000 ns/op
     p(90.0000) =   2304.000 ns/op
     p(95.0000) =   3204.000 ns/op
     p(99.0000) =   3272.000 ns/op
     p(99.9000) =  10672.000 ns/op
     p(99.9900) =  17868.301 ns/op
     p(99.9990) = 486270.789 ns/op

I read 20-30ns, exactly on par with our benchmarking in the multithread case.     

Here is the summary table.    Coversant Disruptor fairs pretty well in every case.

Benchmark                                       (burstLength)    Mode     Cnt      Score      Error  Units
DisruptorBenchmark.test1Producer                            1  sample  345870    134.733 ±    0.884  ns/op
DisruptorBenchmark.test1Producer                          100  sample  281436   2759.219 ±  156.416  ns/op
DisruptorBenchmark.test2Producers                           1  sample  642211    437.281 ±  366.231  ns/op
DisruptorBenchmark.test2Producers                         100  sample  570393  11048.353 ± 1798.365  ns/op
DisruptorBenchmark.test3Producers                           1  sample  730151    985.686 ±  483.954  ns/op
DisruptorBenchmark.test3Producers                         100  sample  650905  23529.651 ± 2739.922  ns/op
DisruptorBlockingQueueBenchmark.test1Producer               1  sample  313751    146.263 ±    1.033  ns/op
DisruptorBlockingQueueBenchmark.test1Producer             100  sample  378077   2146.201 ±  148.813  ns/op
DisruptorBlockingQueueBenchmark.test2Producers              1  sample  605957    529.000 ±  339.598  ns/op
DisruptorBlockingQueueBenchmark.test2Producers            100  sample  613663   6672.167 ± 1225.265  ns/op
DisruptorBlockingQueueBenchmark.test3Producers              1  sample  808469    893.126 ±  411.736  ns/op
DisruptorBlockingQueueBenchmark.test3Producers            100  sample  815679  14184.251 ± 1825.509  ns/op
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Martin Thompson

unread,
Mar 9, 2016, 6:32:56 PM3/9/16
to mechanica...@googlegroups.com
John,

Can you send a pull request for the benchmark you have written?

Martin...

--
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,
Mar 9, 2016, 8:09:49 PM3/9/16
to mechanica...@googlegroups.com
Right, it's not complex but has a slight head scratching element to it as it looks to be achieving something pointless.  It's head scratching in that it's not the obvious implementation, which ought to make anyone reading the code pause (unnecessarily, IMO) for just a second.

A size of 0 coming from Math.max(0, delta) is just as fair of an answer as the loop because clearly size=0 occurred "at some point near the present".  I've seen this paradigm before, and always wondered why people go through the (admittedly little) effort in an attempt to present something of no more value than the dirt simple and obvious version with a slightly more involved construct.


--
Sent from my phone

Vitaly Davidovich

unread,
Mar 9, 2016, 8:22:41 PM3/9/16
to mechanica...@googlegroups.com
I haven't looked at the disruptor benchmark suite, so the disclaimer is I may say something dumb, in which case please correct me.

As someone mentioned upthread, Disruptor is also the storage backing the items exchanged (with a memory indirection to boot) - publishing an item involves loading that storage and performing translation (copying).  The conversant disruptor is a queue where caller has already prepared the object for publication.  Does the benchmark account for that? If it's not accounted for, it's a bit of apples to tomatoes comparison; that's not to take anything away from conversant on its own, just likely should either not compare them or call out this difference, IMO.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
Sent from my phone

Benedict Elliott Smith

unread,
Mar 9, 2016, 8:37:55 PM3/9/16
to mechanica...@googlegroups.com
Not disputing the sense of your approach Vitaly - I tend to see little point in producing a stable result for something so ephemeral.  But I'm pretty sure the calculation you provide doesn't guarantee that zero actually did occur, just that it may have done. i.e. that some interleaving of a subset of the mutations whose executions overlapped with the call to size could have produced the value you see (not that any such actual subset did, for any value, including zero).

Vitaly Davidovich

unread,
Mar 9, 2016, 8:47:05 PM3/9/16
to mechanica...@googlegroups.com
Perhaps this is a bit philosophical, but 0 did occur if you're negative now; at some point, you took a snapshot of head (or tail), then got preempted.  While preempted, the other side advanced on the number line crossing 0 at some point - you just didn't see that point due to scheduling.  But my point really is that this semantic difference, if any, is devoid of any real meaning for things like concurrent size().  As you agree, *any* value returned is transient.  The key is it's not a bogus value (it's within the spec range) and is not out-of-thin air (it's a real observable value, even if scheduling didn't let you see that precisely).

Nitsan Wakart

unread,
Mar 10, 2016, 1:42:20 AM3/10/16
to mechanica...@googlegroups.com
0 didn't have to happen for you to see 0 or negative values.
You can have producer/consumer progress in lockstep and 'see' 0 or negative despite the fact the the queue size has remained effectively the same throughout the time of your observation.

Nitsan Wakart

unread,
Mar 10, 2016, 2:13:59 AM3/10/16
to mechanica...@googlegroups.com
John,
The benchmark measures the latency of a burst of messages + cost of signaling back. This is close to 'latency' when you are sending 1 message. The cost of discovery/initial signalling is quite high. This is not the benchmark lying to you, or benchmark overhead when you are looking at a single message. Sending the first message is the most expensive case as there's the least oppurtunity for amortizing costs (e.g. cache miss on the producer index, cache miss on the array elements cache line etc).

"I read 20-30ns, exactly on par with our benchmarking in the multithread case"
The cost of sending 100 messages is NOT something you divide by 100 to find the latency of sending 1 message. This is the same as sending 100 people in a bus and hoping each one will arrive in 1% of the time it take the bus to get wherever. Cost/latency are not the same thing. Consider the plain costs here, producer will write out to LLC, consumer will read from LLC, that alone with no other overheads or contention will be more than 30ns. I'm not sure what the lowest possible latency between cores is but there's no software solution to hardware limitations.

"Benchmark                                       (burstLength)    Mode     Cnt      Score      Error  Units
DisruptorBenchmark.test1Producer                            1  sample  345870    134.733 ±    0.884  ns/op
DisruptorBenchmark.test1Producer                          100  sample  281436   2759.219 ±  156.416  ns/op
DisruptorBenchmark.test2Producers                           1  sample  642211    437.281 ±  366.231  ns/op
DisruptorBenchmark.test2Producers                         100  sample  570393  11048.353 ± 1798.365  ns/op
DisruptorBenchmark.test3Producers                           1  sample  730151    985.686 ±  483.954  ns/op
DisruptorBenchmark.test3Producers                         100  sample  650905  23529.651 ± 2739.922  ns/op
DisruptorBlockingQueueBenchmark.test1Producer               1  sample  313751    146.263 ±    1.033  ns/op
DisruptorBlockingQueueBenchmark.test1Producer             100  sample  378077   2146.201 ±  148.813  ns/op
DisruptorBlockingQueueBenchmark.test2Producers              1  sample  605957    529.000 ±  339.598  ns/op
DisruptorBlockingQueueBenchmark.test2Producers            100  sample  613663   6672.167 ± 1225.265  ns/op
DisruptorBlockingQueueBenchmark.test3Producers              1  sample  808469    893.126 ±  411.736  ns/op
DisruptorBlockingQueueBenchmark.test3Producers            100  sample  815679  14184.251 ± 1825.509  ns/op"

Assuming you have maintained some reasonable benchmarking hygene and the results are correct (using taskset, disabling turbo, quite machine etc.) you are looking at comparing MPMC to MPSC. This may sound like MPSC has an advantage, but in this benchmark the opposite is the truth. The benchmark has a consumer spinning on a queue and the producer putting in a burst of messages, the queue is empty to start off so overheads for the first discovered elements are high. If the consumer is faster then the producer (which is easily achievable for MPSC) the queue will remain close to empty throughout the benchmark, leading to higher overheads. I have seen the same effect when comparing MPMC and MPSC queues in similar benchmarks. And similar effects when adding more producer threads. If you choose to read from that that and MPMC is "faster" than an MPSC when both are used as SPSC, that is up to you.





Vitaly Davidovich

unread,
Mar 10, 2016, 7:34:07 AM3/10/16
to mechanica...@googlegroups.com
You can observe 0 with stable snapshots, just not negative - the observation in size() is completely uncoordinated with producers and consumers.  Them moving in lockstep has no bearing on what value an uncoordinated observer sees. To capture the "effectively the same size throughout the time" effect or generally to make statistically significant inference, collect enough samples.

This comes down to Quality of Implementation.  We'll all agree hard coding a 0 return, while valid value, would be poor QoI.  My claim is that completely ignoring preemption between two physically adjacent memory reads is identical QoI as the stable snapshot.

Benedict Elliott Smith

unread,
Mar 10, 2016, 8:03:36 AM3/10/16
to mechanica...@googlegroups.com
What's your definition of uncoordinated here, exactly?  If the index values only increase, confirming that one of these values was the same either side of a measurement of the other value guarantees that the queue really did pass through a state representing that size at some arbitrary point during the execution of the method.  That seems pretty coordinated to me.  Of course, the value can change and be completely stale by the time the method exits, but that's not the same as reporting a value that never occurred.

Said differently: there are definitely sequences of mutations that would yield a zero (or arbitrarily large, depending on how you do it) value for your implementation, where the real value is - at all times - (close to) the opposite.

In many cases these kinds of attempts to provide a stable value really are meaningless - for instance, ConcurrentHashMap used to (iirc) try to achieve a stable size value across all segments, which is far harder to conceive of a meaningful semantic difference (it is possible, but vanishingly less likely, to see such a blatant misreport).  Here, too, it's likely the distinction isn't meaningful for most cases, but there are conceivably situations in which the distinction does matter.

Vitaly Davidovich

unread,
Mar 10, 2016, 8:22:46 AM3/10/16
to mechanical-sympathy
What's your definition of uncoordinated here, exactly?

Observer takes a measurement at a time of its own choosing without producer/consumer being aware of this.

If the index values only increase, confirming that one of these values was the same either side of a measurement of the other value guarantees that the queue really did pass through a state representing that size at some arbitrary point during the execution of the method.  That seems pretty coordinated to me.

That's not coordinated, just stable observation.  By uncoordinated, I mean an observer can always see 0 even with stable snapshots if they always happen to take an observation when consumer has caught up to producer.  There're absolutely no guarantees here.  If the intent is to not observe 0s all the time due to being unlucky, then collect more samples.

Said differently: there are definitely sequences of mutations that would yield a zero (or arbitrarily large, depending on how you do it) value for your implementation, where the real value is - at all times - (close to) the opposite.

As I mentioned, to yield a statistically significant measurement, collect more samples.  There's no such thing as "real value, at all times" when you're using an uncoordinated estimate/observation.  Producers and consumers are running at full speed (let's assume); there's no single "real value at all times here", just a statistically significant value.

Here, too, it's likely the distinction isn't meaningful for most cases, but there are conceivably situations in which the distinction does matter.

I'm not suggesting that getting stable snapshots in some cases are important, I'm specifically talking about this example (and similar code for identical purpose that I've seen elsewhere).  We can be paranoid about preemption in between reading head/tail, but then we should also be paranoid about the unbounded stable read loop not terminating either.  But I haven't heard anyone express concern about such loops.

Benedict Elliott Smith

unread,
Mar 10, 2016, 8:31:32 AM3/10/16
to mechanica...@googlegroups.com
I agree there's no such thing as "this is the current value" - but there is clearly a distinction between values that represent states that really were occupied by the structure (however briefly), and values that not only never were, but are up to infinity from the real value.

but then we should also be paranoid about the unbounded stable read loop not terminating either

Why does one follow from the other? Although I completely agree that an optimal solution to this would make clear the tradeoffs in the API and, for instance, accept a parameter indicating the maximum amount of misreport that's accepted (as opposed to the currently implicit zero or infinity, of the two proposed alternatives), as well as a maximum number of loops to execute, returning some no result value if that is exceeded.  That isn't what is being debated here.  The question is if the semantics are the same, and they clearly are not; one is not better than the other, they simply offer different behaviours.  

Like most such things in computing, most users of the API won't even know or care there is this distinction in behaviour, on either side of the coin.

Vitaly Davidovich

unread,
Mar 10, 2016, 9:16:48 AM3/10/16
to mechanical-sympathy
Why does one follow from the other?

One doesn't follow from the other, it was a tongue-in-cheek remark.  My point was some people decide that they should implement size() to avoid preemption between two physically adjacent instructions - highly unlikely scenario, especially unlikely to repeat over and over.  The tongue-in-cheek is: if you're paranoid about that aspect, why not be paranoid that your loop never terminates? Clearly, the person writing this assumes it impossible (not just unlikely, but impossible) for that to happen.

The question is if the semantics are the same, and they clearly are not; one is not better than the other, they simply offer different behaviours.

The semantics are different when looking at them in isolation.  Is it significant for the case of size() implemented (via the 2 alternatives) as discussed in this thread? My claim is no.

Nitsan Wakart

unread,
Mar 10, 2016, 9:23:56 AM3/10/16
to mechanica...@googlegroups.com
"My point was some people decide that they should implement size() to avoid preemption between two physically adjacent instructions - highly unlikely scenario, especially unlikely to repeat over and over."
I'll repeat my original comment here. This is a balance between accuracy and effort. Returning 0 always is low effort and low accuracy.
Returning a value that is negative or exceeds capacity is incorrect.
Returning the value you suggest is better accuracy and low effort.
Computing size as per JCTools is slightly higher effort for handling yet another edge case. It is worth it? I dunno, it's done... not sure the philosophical debate is worth the effort, but what do I know ;-)


Vitaly Davidovich

unread,
Mar 10, 2016, 9:29:15 AM3/10/16
to mechanica...@googlegroups.com
Agreed.  I didn't intend for this subject to drag on for this long either.  But you know, we like to discuss the finer details of things on this list ... :)


On Thursday, March 10, 2016, 'Nitsan Wakart' via mechanical-sympathy <mechanica...@googlegroups.com> wrote:
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Benedict Elliott Smith

unread,
Mar 10, 2016, 9:36:29 AM3/10/16
to mechanica...@googlegroups.com
Well, there is one distinction, which is that if size is called often enough then eventually an execution will be preempted there, and we have a problematic value.  Whereas the loop's very act of not terminating continues to provide opportunities for it to terminate.  So the likelihood of it never terminating is really pretty much zero, as opposed to just extraordinarily unlikely.  But the real question is which of these unlikely events do you want to avoid: outliers in value or execution time?  One system may care about one, and not the other.

But I agree that expecting preemption to occur more than a handful of times is getting into beyond improbable territory, and simply reporting the most accurate value of a handful of loops would be more sensible.  Probably the best tradeoff (ignoring the cost of discussing this :) is a loop that progressively increases the amount of inaccuracy permitted, so that prompt termination is guaranteed and accuracy is still extremely likely.

Simon Thornington

unread,
Mar 10, 2016, 9:40:52 AM3/10/16
to mechanical-sympathy
It seems to me that though the two methods might be philosophically different, they are the same in that the usefulness of their value is the same.  The size is just a hint of the queue length at some point in the past.  Given that the "stronger" loopier size is no more useful, one does wonder "why bother?".  

John Cairns

unread,
Mar 10, 2016, 10:26:04 AM3/10/16
to mechanical-sympathy
If a fact is unknowable you can't know it even if you try to code around it very carefully.   In the spirit of quieting the doubters, I added the max(tail - head, 0) implementation.   But the truth is this solution is just as poor as any other attempt to "interpret the stable value."   In fact, it is deceptive because in returning seemingly well mannered values you falsely lead people to believe they can rely on the result of the size() calculation.

Most of the time tail - head will return a correct value.   In the 1 in a million scenario posited by this preemptive gap situation tail will essentially be a random variable.

What we know at the start:

tail <= head + queueSize

After some preemption delay you will have:

tail - head' <= queueSize

This means that the size() method can return any valid value including 0 and any negative value without any way to predict when it will or if it did.

When Nitsan called my implementation "broken" the funny part is that his implementation is equally broken but he didn't realize it.

There are only three technically correct solutions:

1.   Throw UnsupportedOperationException
2.   synchronize versus state change operations at the cost of an
order of magnitude in performance
3.   increment an atomic size counter at the cost of a factor of two
in performance

I find all three of these to be unsatisfactory.   A Disruptor application should typically assume that size() is zero, it can use the "broken" size() implementation to validate but not to prove this hypothesis. 

Calling out one size() implementation over another is like saying you like the taste of peanut butter more than apple butter.    The emperor has no clothes. 
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Benedict Elliott Smith

unread,
Mar 10, 2016, 10:38:52 AM3/10/16
to mechanica...@googlegroups.com
AFAICT 2 and 3 are semantically identical to looping for a stable value, and only differ in execution time for mutation/size.  So I would say that by your definition, Nitsan's approach is technically more correct than yours.

However, as far as I can tell all of the stated approaches are technically correct.  The only technically incorrect solution is one that is mistaken about its contract; there is no correct contract.


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
Sent from my phone

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,
Mar 10, 2016, 10:44:12 AM3/10/16
to mechanica...@googlegroups.com
Your size() wasn't broken, just unusual/unconventional.  But I don't understand - your queue implements your own interface, why even provide size() on it?
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
Sent from my phone

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

John Cairns

unread,
Mar 10, 2016, 12:46:05 PM3/10/16
to mechanical-sympathy, nit...@yahoo.com
Although I agree with you that there are significant advantages for the "Burst = 100" case, I don't really buy into the people on the bus analogy.    Is your "bus" a cache line?  If so it only has 8 seats?  More of a minivan don't you think?    Anyway it is still limited by the processor interconnect like any other transfer.   That interconnect is putting a pretty strong constraint on the amount of advantage you get from a burst.

I do agree with you that the language around latency is not very good here.    We should probably talking mega-transfers/second as is often done with CPU interconnects.    In my measurement I'm seeing around 36MT/s.   That is higher than any other Java Blocking Queue.

John

Martin Thompson

unread,
Mar 10, 2016, 2:08:37 PM3/10/16
to mechanical-sympathy
You posted benchmark results without publishing the benchmark so I wrote a benchmark to get some independently verifiable results.  My benchmark and full results are attached.

I run my tests on a Linux 4.2 kernel and a Ivy Bridge Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz

The figures you post below look like what I typically see on a Intel Xeon with Linux not configured for low latency. You will find my figures quite different.

Couple of observations I made when creating the benchmark:
- Other than ideas taken from the Disruptor codebase, the name Disruptor is confusing as it does not offer any of Disruptor features or API.
- The code does not implement the Queue interface so it is not a drop in replacement as suggested.

From the performance tests I do not see a performance advantage over the the LMAX Disruptor. Conversant Disruptor is higher latency plus it has very unpredictable latency.

The the burst length of one test measures a single round trip transfer. If I compare to baseline then I the Conversant Disruptor give a uncontended latency of ~150ns for a one way transfer. This is NOT just test overhead. Profile to see where the time goes. Busts of 100 with multiple producers measures contention.

When I look at the contended case things get more interesting. The error bar is larger than other implementations I have measured. On reviewing the implementation this is partially down to the blocking action I mentioned between the producers but even more so due to the yielding and parking. The parking is not coming out for over 50 microseconds, you think you are getting 50ns (hardcoded in your code) but in reality you are getting at least 50us + 50ns. This makes latency unpredictable and can take threads out of the race so throughput can seem artificially inflated.

In a low latency financial trading environment I'm  clear of which implementation I would take. 

# JMH 1.11.3 (released 56 days ago)
# VM version: JDK 1.8.0_74, VM 25.74-b02
# VM invoker: /home/martin/opt/jdk1.8.0_74/jre/bin/java
# VM options: -Dagrona.disable.bounds.checks=true
# Warmup: 5 iterations, 1 s each
# Measurement: 10 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 3 threads, will synchronize iterations
# Benchmark mode: Sampling time
# Benchmark: uk.co.real_logic.benchmarks.latency.ConversantDisruptorBenchmark.test3Producers
# Parameters: (burstLength = 100)

Histogram, ns/op:
[ 0.000, 1250000.000) = 2818800
[ 1250000.000, 2500000.000) = 9794
[ 2500000.000, 3750000.000) = 2367
[ 3750000.000, 5000000.000) = 611
[ 5000000.000, 6250000.000) = 219
[ 6250000.000, 7500000.000) = 50
[ 7500000.000, 8750000.000) = 8
[ 8750000.000, 10000000.000) = 3
[10000000.000, 11250000.000) = 1
[11250000.000, 12500000.000) = 0
[12500000.000, 13750000.000) = 1
[13750000.000, 15000000.000) = 0
[15000000.000, 16250000.000) = 0
[16250000.000, 17500000.000) = 0
[17500000.000, 18750000.000) = 0

Percentiles, ns/op:
p(0.0000) = 2796.000 ns/op
p(50.0000) = 7304.000 ns/op
p(90.0000) = 9408.000 ns/op
p(95.0000) = 13280.000 ns/op
p(99.0000) = 563200.000 ns/op
p(99.9000) = 2646016.000 ns/op
p(99.9900) = 4995600.384 ns/op
p(99.9990) = 6938987.315 ns/op
p(99.9999) = 9553906.564 ns/op
p(100.0000) = 13680640.000 ns/op


# JMH 1.11.3 (released 56 days ago)
# VM version: JDK 1.8.0_74, VM 25.74-b02
# VM invoker: /home/martin/opt/jdk1.8.0_74/jre/bin/java
# VM options: -Dagrona.disable.bounds.checks=true
# Warmup: 5 iterations, 1 s each
# Measurement: 10 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 3 threads, will synchronize iterations
# Benchmark mode: Sampling time
# Benchmark: uk.co.real_logic.benchmarks.latency.DisruptorBenchmark.test3Producers
# Parameters: (burstLength = 100)

Histogram, ns/op:
[ 0.000, 2500.000) = 4084352
[ 2500.000, 5000.000) = 207
[ 5000.000, 7500.000) = 17
[ 7500.000, 10000.000) = 17
[10000.000, 12500.000) = 4
[12500.000, 15000.000) = 2
[15000.000, 17500.000) = 2
[17500.000, 20000.000) = 1
[20000.000, 22500.000) = 0
[22500.000, 25000.000) = 0
[25000.000, 27500.000) = 1

Percentiles, ns/op:
p(0.0000) = 139.000 ns/op
p(50.0000) = 286.000 ns/op
p(90.0000) = 418.000 ns/op
p(95.0000) = 500.000 ns/op
p(99.0000) = 670.000 ns/op
p(99.9000) = 875.000 ns/op
p(99.9900) = 1232.000 ns/op
p(99.9990) = 5187.695 ns/op
p(99.9999) = 15058.109 ns/op
p(100.0000) = 26016.000 ns/op

Benchmark                                    (burstLength)    Mode      Cnt      Score     Error  Units
ConversantDisruptorBenchmark.test1Producer               1  sample  1757757    243.276 ±   0.155  ns/op
ConversantDisruptorBenchmark.test1Producer 100 sample 1329366 6601.872 ± 1.642 ns/op
ConversantDisruptorBenchmark.test2Producers 1 sample 3081830 277.336 ± 0.110 ns/op
ConversantDisruptorBenchmark.test2Producers 100 sample 2765956 16047.311 ± 242.942 ns/op
ConversantDisruptorBenchmark.test3Producers 1 sample 3342357 366.687 ± 0.217 ns/op
ConversantDisruptorBenchmark.test3Producers 100 sample 2831854 25040.348 ± 340.575 ns/op
DisruptorBenchmark.test1Producer 1 sample 1766823 230.556 ± 0.176 ns/op
DisruptorBenchmark.test1Producer 100 sample 1600208 7240.488 ± 1.655 ns/op
DisruptorBenchmark.test2Producers 1 sample 2954373 257.757 ± 0.105 ns/op
DisruptorBenchmark.test2Producers 100 sample 2852658 15778.326 ± 3.778 ns/op
DisruptorBenchmark.test3Producers 1 sample 4084603 310.108 ± 0.165 ns/op
DisruptorBenchmark.test3Producers 100 sample 3043268 29028.470 ± 16.597 ns/op
human.txt
ConversantDisruptorBenchmark.patch

Benedict Elliott Smith

unread,
Mar 10, 2016, 2:39:48 PM3/10/16
to mechanica...@googlegroups.com
It's a bit of an unfair comparison to pit a non-parking disruptor with fewer threads than physical cores against a queue that parks.  The disruptor also fairs badly with parking.

This seems very reminiscent of the conversation about size: the question is what are you goals, constraints and tradeoffs? No doubt, most people don't understand these things well enough to make an informed decision, so it can help to dispel misconceptions.  But it doesn't seem to me this analysis does that. 

To presuppose everyone is in high frequency trading and has fewer threads than physical cores (and hence should be using a disruptor) is patently false.  In fact I've rarely seen anyone deploy the disruptor to such a setup - every instance I've encountered has used a blocking strategy for a queue that will regularly be exhausted.  The result is just making people feel good about having deployed the disruptor, not having derived any actual benefit therefrom.

Since the "conversant disruptor" has focused its measurements on peak transfer rates under saturation, I doubt it has properly exercised its behaviour in realistic blocking cases either.  But just focusing on this imbalanced comparison does not seem to advance collective understanding.

(I don't blame you for being annoyed about the naming though)

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Martin Thompson

unread,
Mar 10, 2016, 3:05:27 PM3/10/16
to mechanica...@googlegroups.com
You are absolutely right on the tradeoffs and only a small proportion of people are in high frequency trading.

My main goal was to test the latency claims. I have a clear answer for contended and uncontended transfers between threads.

John Cairns

unread,
Mar 10, 2016, 6:57:11 PM3/10/16
to mechanical-sympathy
Hi,

You have a pull request on github as you requested.   Perhaps it was lost in your inbox?   

Out of curiosity what do you get when you set SpinPolicy.SPINNING?

John

Fredrik Lydén

unread,
Mar 11, 2016, 4:04:07 AM3/11/16
to mechanica...@googlegroups.com
Regarding drop-in replacement, only the BlockingQueue implementations (PushPullBlockingQueue and DisruptorBlockingQueue) also implements Queue, and in our specific use case this was extremely handy.

+1 that the name is a bit confusing though.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/c5x0c2Zsfpc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
CTO
Scila AB
Sveavägen 25, 8TR
111 34 Stockholm
Sweden

Direct: +46 8 546 402 92
Mobile: +46 73 707 30 11
fredri...@scila.se
www.scila.se

Martin Thompson

unread,
Mar 11, 2016, 5:01:47 AM3/11/16
to mechanica...@googlegroups.com
I see you posted a new version of your library and a PR for the tests after I had to write my own to verify your benchmark results post. This was too late. It is good practice to post benchmarks at the same time you make results claims so others can independently verify results. We have a name for that. It is called science :-)

At this stage I am moving on. For an MPSC queue I think the work by Dmitry Vyukov of (www.1024cores.net) fame provides better throughput and latency. Java ports of his C++ work can be found in Agrona and JCTools.This can be wrapped with a progressive backoff strategy if someone wants pseudo blocking semantics. I don't see anything novel with the "Conversant Disruptor". I see a number of internal techniques copied from the early versions of the Disruptor applied to queue internals. Many developers mistakenly use the original Disruptor thinking is a queue replacement. The Disruptor is useful when you require event pooling, a graph of dependent event processors, and batch API semantics. These features that you call "over engineering" on your blog are used to great effect by many who use the Disruptor appropriately, i.e. not just as a queue replacement.

The Disruptor has wait strategies that are swappable for those who need to choose between conditional variable based blocking, progressive backoff, and spinning. Choices need to be made taking into account the requirements for CPU resource and desired latency profile. Making the tradeoffs is a non-trivial subject and this thread is a great illustration to me of how far as an industry we still have to go. If anyone is interested in a good discussion on waiting and signalling techniques in the context of various trade off then I'd enjoy being part of that.

Martin...



To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Nitsan Wakart

unread,
Mar 11, 2016, 5:21:59 AM3/11/16
to mechanica...@googlegroups.com
"When Nitsan called my implementation "broken" the funny part is that his implementation is equally broken but he didn't realize it."
IMO: size() method which returns a value that is < 0 or > capacity is broken.
Your implementation will do so under certain condition, it was therefore, by that definition, broken. You don't have to agree with the definition.
As for degrees of accuracy of the implementation, I think we went over it. Take your pick and be merry.
I'm not sure what I missed in terms of broken implementation, but I am delighted to have entertained you and anyone else who got the joke.
"The emperor has no clothes."
I must concede that I do most of my coding naked and therefore if I am the emperor you are on the money.

Martin Thompson

unread,
Mar 11, 2016, 5:30:28 AM3/11/16
to mechanical-sympathy
Correction.

MPSC -> MPMC.

Nitsan Wakart

unread,
Mar 11, 2016, 5:35:42 AM3/11/16
to mechanica...@googlegroups.com
"I don't really buy into the people on the bus analogy." The analogy is there to highlight the difference between latency and throughput. Since you later say: "I do agree with you that the language around latency is not very good here." I thought you got the point, but then I read onward to "We should probably talking mega-transfers/second as is often done with CPU interconnects" and lost hope.
You repeatedly took the "time to send 100 messages and see a response" measure as something you can divide to get a single message latency. That is a plain mistake. Or 'not very good language' if you feel like down playing the mistake. It's not correct is what I'm trying to get across. We should not be talking about mega transfers, we already have a perfectly workable measurement on the application level, but you can't derive a single message latency out of it.

ki...@kodewerk.com

unread,
Mar 11, 2016, 8:42:57 AM3/11/16
to mechanica...@googlegroups.com
Guys,

This is tech discussion and if any implementation is “broken” it should be based in fact (a test is an excellent way to establish a fact) and the author should take this as a signal they need to fix something. I’m interested in hearing about what a fix might look like, not what Nissan coding in the nude might look like.

Regards,
Kirk

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
signature.asc

Nitsan Wakart

unread,
Mar 11, 2016, 10:08:59 AM3/11/16
to mechanica...@googlegroups.com
Absolutely Kirk, thank you for correcting the course of discussion. Here's a test:

@Test
public void testSize() throws Exception {
assumeThat(spec.isBounded(), is(true));
final AtomicBoolean stop = new AtomicBoolean();
final Queue<Integer> q = queue;
final Val fail = new Val();
Thread t1 = new Thread(new Runnable() {
@Override
public void run() {
while (!stop.get()) {
q.offer(1);
q.poll();
}
}
});
Thread t2 = new Thread(new Runnable() {
@Override
public void run() {
while (!stop.get()) {
int size = q.size();
if(size != 0 && size != 1) {
fail.value++;
}
}
}
});

t1.start();
t2.start();
Thread.sleep(1000);
stop.set(true);
t1.join();
t2.join();
assertEquals("Unexpected size observed", 0, fail.value);
}

If I leave size as is, all the bounded queues in JCTools pass. The unbounded ones have a size which is less accurate, demonstrating that a size observed concurrently is potentially garbage. This can be solved for linked queues by adding node indices, which comes at the cost of extra memory per node.
If I change the size implementation from this: public int size() {
long after = lvConsumerIndex();

while (true) {
final long before = after;
final long currentProducerIndex = lvProducerIndex();
after = lvConsumerIndex();
if (before == after) {
return (int) (currentProducerIndex - after);
}
}
}
To this:
public int size() {
long cIndex = lvConsumerIndex();
long pIndex = lvProducerIndex();
return (int) (pIndex - cIndex);
}
The test consistently fails.
Reply all
Reply to author
Forward
0 new messages