Eat What You Kill

Greg Wilkins

unread,

Apr 24, 2015, 3:10:03 AM4/24/15

to mechanica...@googlegroups.com

Many many months ago, I asked in this forum about how to avoid the parallel slowdown in handling requests from a multiplexed HTTP/2 connection. The problem being that to avoid HOL blocking, requests are dispatched to other threads, but by doing so that means the CPU core that handles are request is mostly likely a different one than that which parsed it, so it's cache will be cold of all the request data.

The suggestion here was that I look at some kind of work stealing algorithm to avoid HOL blocking while keeping request streams mostly on the same core by using a single queue per thread. Good idea, but was too complex to implement in our environment (Jetty). We also looked at the disruptor and it was also not a good fit.

So we have come up with our own scheduling strategy for Jetty-9.3's HTTP/2 which we have nicknamed Eat What You Kill and it implements the producer consumer pattern with mechanical sympathy.

I've written up a blog describing the problem and our solution, which you can preview here: https://webtide.com/?p=2870&preview=1&_ppp=f8c4ae3461 I'd very much appreciate some review/feedback from this forum before I publish that blog - specially if I have: accidentally plagiarised an existing idea; missed something which means I'm fooling myself; badly described the whole thing; etc

cheers

Sergey Zubarev

unread,

Apr 24, 2015, 6:04:12 AM4/24/15

to mechanica...@googlegroups.com

Hi, interesting post!

Have you any latency numbers as well for strategies you tested?

Greg Wilkins

unread,

Apr 24, 2015, 8:14:44 AM4/24/15

to mechanica...@googlegroups.com

On Friday, 24 April 2015 20:04:12 UTC+10, Sergey Zubarev wrote:

Hi, interesting post!
Have you any latency numbers as well for strategies you tested?

Not really, I'm using JMH for the first time and while I can ask it to collect latency numbers I'm not sure it is collecting exactly what I think they are. So I need to understand more about JMH before I report those numbers and can comment on what they say.

It does report low latency for PC and EWYK as I would have expected, but EWYK is only a little smaller than PC and I would have expected to see the HOL blocking delays in the PC measure. PEC has a long latency, which is also expected, but it is actually a much longer latency than I would expect.

I think the problem is that JMH is measuring the latency for the entire connection, when the interesting number is the latency (and latency distribution) of each individual request. So I need to instrument up each request in a way that does not introduce contention and data/cache effects between the connections.

ymo

unread,

Apr 24, 2015, 10:10:14 AM4/24/15

to mechanica...@googlegroups.com

Have you looked at how they did it in aeron ? If i was writing an nio selector from scratch i would follow the same guidelines. Its based on a lock free buffer which has single producer and multi consumers as you have above.

Vitaly Davidovich

unread,

Apr 24, 2015, 10:21:20 AM4/24/15

to mechanica...@googlegroups.com

How does Aeron handle slow consumers? If the buffer is overrun (say all/most consumers are slow), what's the backpressure model? Also, how much work does Aeron do in the producer? Is it parsing a message and then handing it off or does it hand off just a pointer to a consumer which then does bulk of the work? There's something to be said about keeping processing of a request/message all on the same core, if possible, and minimizing data movement across caches.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ymo

unread,

Apr 24, 2015, 10:35:41 AM4/24/15

to mechanica...@googlegroups.com

AFAIK it does not do the payload parsing since it is only a transport protocol. The guys who wrote this thing are here so they can give more details about it.

I am not sure about the back pressure model since this is always application specific. But when you are running at full capacity the easiest thing i found was to just drop new connections but try to complete the ones in flight )))

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Martin Thompson

unread,

Apr 24, 2015, 10:37:33 AM4/24/15

to mechanica...@googlegroups.com

Aeron has a single thread receiving from the network, and a single thread sending. This thread copies the data into data structures that are append only and can be ready by multiple threads without any locks or full fences. The single sender or receiver thread can easily saturate a 10GigE connection.

Each consumer has a progress position counter that is used for flow control, aka back pressure. The algorithms are pluggable. Default behaviour is to gate on the min position, i.e. the slowest. We are looking at adding strategies like discarding cry babies that cannot keep up. They can be then catch up if they like by requesting older data from a archive service. We track absolution positions and can identify any byte across time on a stream.

On the publication side, each producer claims space in the outgoing buffer and copies in its data and sets up the network framing header so the sender only copies the data to the socket.

We deal with head-of-line blocking, loss detection, and retransmission without impacting the normal flow for the in-flight window.

Our benchmarking shows this to be 3-5X greater throughput for small messages than the best existing commercial products. We can also beat the best on mode latency and are in a different league when it comes to the distribution in the tail.

ymo

unread,

Apr 24, 2015, 10:42:39 AM4/24/15

to mechanica...@googlegroups.com

I kind of don't understand the JMH benchmark altogether. Specially about the part where you have .threads(2000) ?

For the sake of sanity here is what i do :

1) run one producer on one core

2) run all consumers on other cores

3) make sure i am always on the same socket

4) make sure the kernel is not scheduling anything on those cores

It is a benchmark after all but at least it removes all the other variables from the equation to only test how new consumers are processing your requests. makes sense ?

On Friday, April 24, 2015 at 3:10:03 AM UTC-4, Greg Wilkins wrote:

Vitaly Davidovich

unread,

Apr 24, 2015, 10:46:40 AM4/24/15

to mechanica...@googlegroups.com

Thanks Martin. The append-only log/datastructure -- is it backed by disk (mmap)? I'm assuming it's bounded to some size, so what happens if you get a bunch of slow consumers? I understand they won't prevent other consumers from going through the log and consuming subsequent messages, but if you're not dropping them now altogether, they must be handcuffing you to some degree by not allowing release/reuse of old logs? How's the memory consumption going to look like here?

Also, what is the memory footprint of Aeron, say, in default configuration? I'm talking about purely its own overhead, not anything arbitrary consumers would add.

Vitaly Davidovich

unread,

Apr 24, 2015, 10:59:09 AM4/24/15

to mechanica...@googlegroups.com

Also which commercial or otherwise products did you compare against? Just curious.

Also, separately, cloudius systems have a nice looking open source framework (it's not a protocol) with sound performance principles: http://www.seastar-project.org. Someone linked to it on this list a few weeks ago.

sent from my phone

Martin Thompson

unread,

Apr 24, 2015, 10:59:25 AM4/24/15

to mechanica...@googlegroups.com

Thanks Martin. The append-only log/datastructure -- is it backed by disk (mmap)? I'm assuming it's bounded to some size, so what happens if you get a bunch of slow consumers? I understand they won't prevent other consumers from going through the log and consuming subsequent messages, but if you're not dropping them now altogether, they must be handcuffing you to some degree by not allowing release/reuse of old logs? How's the memory consumption going to look like here?

The log files are in SHM. We default it to /dev/shm on Linux to avoid page faults. Per stream Aeron rotates three log partitions Active, Dirty, and Clean. Default memory is 16MB per partition and can be increased to what you want. They are sized to handle in-flight windows for loss and what you want to tolerate from slow subscribers. With the default flow control strategy you are held back by the slowest subscriber. If one of them stops then the current stream is back pressured. This is a good "teaching" strategy :-) Over time we plan to add a suite of flow control and congestion control strategies and we have an open API so people can plug in their own.

I talk about how it works in this presentation:

http://www.infoq.com/presentations/aeron-messaging

Also, what is the memory footprint of Aeron, say, in default configuration? I'm talking about purely its own overhead, not anything arbitrary consumers would add.

Foot print is tiny for Aeron if you discount the log buffers. The size then is a function of how many streams you wish to run and size of the log buffer partitions. For another project I'm adding persistence of stream to disk so a slow subscriber, or disconnected subscriber, can query a stream to join it from a position.

Martin Thompson

unread,

Apr 24, 2015, 11:04:01 AM4/24/15

to mechanica...@googlegroups.com

On 24 April 2015 at 15:59, Vitaly Davidovich <vit...@gmail.com> wrote:

Also which commercial or otherwise products did you compare against? Just curious.

I'd rather not list them in public. Use your imagination for major providers of low-latency messaging in finance...

As with all good benchmarking, don't take my word for it, measure Aeron against whatever you have and your application needs then judge for yourself :-P

Also, separately, cloudius systems have a nice looking open source framework (it's not a protocol) with sound performance principles: http://www.seastar-project.org. Someone linked to it on this list a few weeks ago

Thanks, I'll have a look.

ymo

unread,

Apr 24, 2015, 11:16:28 AM4/24/15

to mechanica...@googlegroups.com

Martin, if you had to support tcp how would you handle tcp fragmentation ? Meaning how you make sure you only call the code using aeron to be notified only when the full (or some significant part) is ready for parsing ?

p.s.

i am hoping this is not high jacking the thread !

Regards.

Vitaly Davidovich

unread,

Apr 24, 2015, 11:20:11 AM4/24/15

to mechanica...@googlegroups.com

The few commercial products I'm aware of typically offer some sort of failover/redundancy/persistence/durability/etc options, which usually come at a performance hit. There's nothing wrong with running from /dev/shm, but it's a different use case.

--

Francesco Nigro

unread,

Apr 24, 2015, 11:48:11 AM4/24/15

to mechanica...@googlegroups.com

Hi Martin,
I've just read the slides and i don't understand this:
"Persistent data structures can be safe to read without locks"

I know that its off-topic but i'm really curious about the explanation...i though that the memory mapoed files should be treated ad every piece of memory with the proper barriers to be written/read by producer/consumer...

Martin Thompson

unread,

Apr 24, 2015, 12:02:13 PM4/24/15

to mechanica...@googlegroups.com

This is what I hate about slides taken on their own and not supporting a presentation. I think Slide Share is so wrong :-) A presentation is a presentation and a document is a document.

"Persistent" in this sense is the functional programming definition and not the storage to disk.

A FP persistent data structure does not mutate under the reader. For the period of time they are valid then they are persistent from the FP perspective.

Hope that helps?

Martin Thompson

unread,

Apr 24, 2015, 12:07:20 PM4/24/15

to mechanica...@googlegroups.com

On 24 April 2015 at 16:20, Vitaly Davidovich <vit...@gmail.com> wrote:

The few commercial products I'm aware of typically offer some sort of failover/redundancy/persistence/durability/etc options, which usually come at a performance hit. There's nothing wrong with running from /dev/shm, but it's a different use case.

The messaging products used in low latency trading usually come with a transient message delivery capability and an optional persistent to storage message capability. We have been benchmarking against them in the transient mode so a fair comparison. We are also talking about the peer-to-peer UDP products and not the broker or TCP based products which are miles behind in performance capabilities.

Additionally, we are adding the support for persistent messages that are also have cluster redundancy. More on that when it is available.

Martin Thompson

unread,

Apr 24, 2015, 12:12:56 PM4/24/15

to mechanica...@googlegroups.com

Martin, if you had to support tcp how would you handle tcp fragmentation ? Meaning how you make sure you only call the code using aeron to be notified only when the full (or some significant part) is ready for parsing ?

We have no plans for supporting TCP. TCP is a rigid protocol that is baked into kernels. We want to innovate in user space and we can do that better on UDP. There is no way we could get close to the latency we see in Aeron with TCP.

p.s.
i am hoping this is not high jacking the thread !

I hope it is not too much of an Aeron thread now for other!

Benedict Elliott Smith

unread,

Apr 24, 2015, 1:59:23 PM4/24/15

to mechanica...@googlegroups.com

This approach is similar to, or at least addresses the same kinds of concerns as, the approach I designed for Apache Cassandra. I've had a blog post sitting around for months that hasn't been published for various reasons, that I've hit publish on today. Competition FTW :)

http://belliottsmith.com/low-overhead-task-execution/

--

Francesco Nigro

unread,

Apr 24, 2015, 2:02:04 PM4/24/15

to mechanica...@googlegroups.com

Touche Mr.T..i'll do my homeworks and go to watch the video...+1 forma the comment on slideshare :P

ymo

unread,

Apr 24, 2015, 3:05:34 PM4/24/15

to mechanica...@googlegroups.com

+1 for the JMH benchmarks )))

It is "unfortunate" that JMH does not come out of the box with support for pinning threads to cores so that you could have more control on how you oversubscribe.

Also it is unfortunate that JMH does not come with cpu performance counters for people that have cpu bound workloads where the injector makes a lot of sense.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Aleksey Shipilev

unread,

Apr 24, 2015, 3:13:52 PM4/24/15

to mechanical-sympathy

On Fri, Apr 24, 2015 at 10:05 PM, ymo <ymol...@gmail.com> wrote:

It is "unfortunate" that JMH does not come out of the box with support for pinning threads to cores so that you could have more control on how you oversubscribe.

JMH does not support pinning, because there is no built-in JDK API we can use. But since JMH does provide the invariant that @Setup methods would be called by the worker threads, you can "just" use AffinityLock-s, etc.

Also it is unfortunate that JMH does not come with cpu performance counters for people that have cpu bound workloads where the injector makes a lot of sense.

Come again? There are: -prof perf, -prof perfasm, -prof perfnorm, (heck, even) -prof xperfasm.

-Aleksey

On Friday, April 24, 2015 at 1:59:23 PM UTC-4, Benedict Elliott Smith wrote:

This approach is similar to, or at least addresses the same kinds of concerns as, the approach I designed for Apache Cassandra. I've had a blog post sitting around for months that hasn't been published for various reasons, that I've hit publish on today. Competition FTW :)

http://belliottsmith.com/low-overhead-task-execution/

On 24 April 2015 at 03:10, Greg Wilkins <gr...@intalio.com> wrote:

Many many months ago, I asked in this forum about how to avoid the parallel slowdown in handling requests from a multiplexed HTTP/2 connection. The problem being that to avoid HOL blocking, requests are dispatched to other threads, but by doing so that means the CPU core that handles are request is mostly likely a different one than that which parsed it, so it's cache will be cold of all the request data.

The suggestion here was that I look at some kind of work stealing algorithm to avoid HOL blocking while keeping request streams mostly on the same core by using a single queue per thread. Good idea, but was too complex to implement in our environment (Jetty). We also looked at the disruptor and it was also not a good fit.

So we have come up with our own scheduling strategy for Jetty-9.3's HTTP/2 which we have nicknamed Eat What You Kill and it implements the producer consumer pattern with mechanical sympathy.

I've written up a blog describing the problem and our solution, which you can preview here: https://webtide.com/?p=2870&preview=1&_ppp=f8c4ae3461 I'd very much appreciate some review/feedback from this forum before I publish that blog - specially if I have: accidentally plagiarised an existing idea; missed something which means I'm fooling myself; badly described the whole thing; etc

cheers

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

ymo

unread,

Apr 24, 2015, 3:34:25 PM4/24/15

to mechanica...@googlegroups.com

Ok i stand corrected ... I used JMH a while ago and missed all the later days improvements Thanks alot Aleksey ))) Maybe i missed it but does it make sense to enable the perf counters per method ?

On Friday, April 24, 2015 at 3:13:52 PM UTC-4, Aleksey Shipilev wrote:

On Fri, Apr 24, 2015 at 10:05 PM, ymo <ymol...@gmail.com> wrote:
It is "unfortunate" that JMH does not come out of the box with support for pinning threads to cores so that you could have more control on how you oversubscribe.

JMH does not support pinning, because there is no built-in JDK API we can use. But since JMH does provide the invariant that @Setup methods would be called by the worker threads, you can "just" use AffinityLock-s, etc.

Also it is unfortunate that JMH does not come with cpu performance counters for people that have cpu bound workloads where the injector makes a lot of sense.

Come again? There are: -prof perf, -prof perfasm, -prof perfnorm, (heck, even) -prof xperfasm.

-Aleksey

On Friday, April 24, 2015 at 1:59:23 PM UTC-4, Benedict Elliott Smith wrote:

This approach is similar to, or at least addresses the same kinds of concerns as, the approach I designed for Apache Cassandra. I've had a blog post sitting around for months that hasn't been published for various reasons, that I've hit publish on today. Competition FTW :)

http://belliottsmith.com/low-overhead-task-execution/

On 24 April 2015 at 03:10, Greg Wilkins <gr...@intalio.com> wrote:

Many many months ago, I asked in this forum about how to avoid the parallel slowdown in handling requests from a multiplexed HTTP/2 connection. The problem being that to avoid HOL blocking, requests are dispatched to other threads, but by doing so that means the CPU core that handles are request is mostly likely a different one than that which parsed it, so it's cache will be cold of all the request data.

The suggestion here was that I look at some kind of work stealing algorithm to avoid HOL blocking while keeping request streams mostly on the same core by using a single queue per thread. Good idea, but was too complex to implement in our environment (Jetty). We also looked at the disruptor and it was also not a good fit.

So we have come up with our own scheduling strategy for Jetty-9.3's HTTP/2 which we have nicknamed Eat What You Kill and it implements the producer consumer pattern with mechanical sympathy.

I've written up a blog describing the problem and our solution, which you can preview here: https://webtide.com/?p=2870&preview=1&_ppp=f8c4ae3461 I'd very much appreciate some review/feedback from this forum before I publish that blog - specially if I have: accidentally plagiarised an existing idea; missed something which means I'm fooling myself; badly described the whole thing; etc

cheers

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Aleksey Shipilev

unread,

Apr 24, 2015, 3:54:58 PM4/24/15

to mechanical-sympathy

Not sure what does it mean to have perf counters per method. perfasm aggregates counters per method. But I tend to think once you need that, you have to employ a full-fledged profiler, like Solaris Studio Performance Analyzer or VTune or something else.

-Aleksey.

On Fri, Apr 24, 2015 at 10:34 PM, ymo <ymol...@gmail.com> wrote:

Ok i stand corrected ... I used JMH a while ago and missed all the later days improvements Thanks alot Aleksey ))) Maybe i missed it but does it make sense to enable the perf counters per method ?

On Friday, April 24, 2015 at 3:13:52 PM UTC-4, Aleksey Shipilev wrote:

On Fri, Apr 24, 2015 at 10:05 PM, ymo <ymol...@gmail.com> wrote:
It is "unfortunate" that JMH does not come out of the box with support for pinning threads to cores so that you could have more control on how you oversubscribe.

JMH does not support pinning, because there is no built-in JDK API we can use. But since JMH does provide the invariant that @Setup methods would be called by the worker threads, you can "just" use AffinityLock-s, etc.

Also it is unfortunate that JMH does not come with cpu performance counters for people that have cpu bound workloads where the injector makes a lot of sense.

Come again? There are: -prof perf, -prof perfasm, -prof perfnorm, (heck, even) -prof xperfasm.

-Aleksey

On Friday, April 24, 2015 at 1:59:23 PM UTC-4, Benedict Elliott Smith wrote:

This approach is similar to, or at least addresses the same kinds of concerns as, the approach I designed for Apache Cassandra. I've had a blog post sitting around for months that hasn't been published for various reasons, that I've hit publish on today. Competition FTW :)

http://belliottsmith.com/low-overhead-task-execution/

On 24 April 2015 at 03:10, Greg Wilkins <gr...@intalio.com> wrote:

Many many months ago, I asked in this forum about how to avoid the parallel slowdown in handling requests from a multiplexed HTTP/2 connection. The problem being that to avoid HOL blocking, requests are dispatched to other threads, but by doing so that means the CPU core that handles are request is mostly likely a different one than that which parsed it, so it's cache will be cold of all the request data.

The suggestion here was that I look at some kind of work stealing algorithm to avoid HOL blocking while keeping request streams mostly on the same core by using a single queue per thread. Good idea, but was too complex to implement in our environment (Jetty). We also looked at the disruptor and it was also not a good fit.

So we have come up with our own scheduling strategy for Jetty-9.3's HTTP/2 which we have nicknamed Eat What You Kill and it implements the producer consumer pattern with mechanical sympathy.

I've written up a blog describing the problem and our solution, which you can preview here: https://webtide.com/?p=2870&preview=1&_ppp=f8c4ae3461 I'd very much appreciate some review/feedback from this forum before I publish that blog - specially if I have: accidentally plagiarised an existing idea; missed something which means I'm fooling myself; badly described the whole thing; etc

cheers

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

ymo

unread,

Apr 24, 2015, 5:15:26 PM4/24/15

to mechanica...@googlegroups.com

Ok i understand if it is not supported but i would think that people want to put the @Perf annotation on the actual method with the name of the particular counter they are interested. And get an aggregate that does not include the warmup time. That would make more sense to me. But maybe i don't have the whole picture on the tool and the target audience.

On Friday, April 24, 2015 at 3:54:58 PM UTC-4, Aleksey Shipilev wrote:

Not sure what does it mean to have perf counters per method. perfasm aggregates counters per method. But I tend to think once you need that, you have to employ a full-fledged profiler, like Solaris Studio Performance Analyzer or VTune or something else.

-Aleksey.

On Fri, Apr 24, 2015 at 10:34 PM, ymo <ymol...@gmail.com> wrote:

Ok i stand corrected ... I used JMH a while ago and missed all the later days improvements Thanks alot Aleksey ))) Maybe i missed it but does it make sense to enable the perf counters per method ?

On Friday, April 24, 2015 at 3:13:52 PM UTC-4, Aleksey Shipilev wrote:

On Fri, Apr 24, 2015 at 10:05 PM, ymo <ymol...@gmail.com> wrote:
It is "unfortunate" that JMH does not come out of the box with support for pinning threads to cores so that you could have more control on how you oversubscribe.

JMH does not support pinning, because there is no built-in JDK API we can use. But since JMH does provide the invariant that @Setup methods would be called by the worker threads, you can "just" use AffinityLock-s, etc.

Also it is unfortunate that JMH does not come with cpu performance counters for people that have cpu bound workloads where the injector makes a lot of sense.

Come again? There are: -prof perf, -prof perfasm, -prof perfnorm, (heck, even) -prof xperfasm.

-Aleksey

On Friday, April 24, 2015 at 1:59:23 PM UTC-4, Benedict Elliott Smith wrote:

This approach is similar to, or at least addresses the same kinds of concerns as, the approach I designed for Apache Cassandra. I've had a blog post sitting around for months that hasn't been published for various reasons, that I've hit publish on today. Competition FTW :)

http://belliottsmith.com/low-overhead-task-execution/

On 24 April 2015 at 03:10, Greg Wilkins <gr...@intalio.com> wrote:

Many many months ago, I asked in this forum about how to avoid the parallel slowdown in handling requests from a multiplexed HTTP/2 connection. The problem being that to avoid HOL blocking, requests are dispatched to other threads, but by doing so that means the CPU core that handles are request is mostly likely a different one than that which parsed it, so it's cache will be cold of all the request data.

The suggestion here was that I look at some kind of work stealing algorithm to avoid HOL blocking while keeping request streams mostly on the same core by using a single queue per thread. Good idea, but was too complex to implement in our environment (Jetty). We also looked at the disruptor and it was also not a good fit.

So we have come up with our own scheduling strategy for Jetty-9.3's HTTP/2 which we have nicknamed Eat What You Kill and it implements the producer consumer pattern with mechanical sympathy.

I've written up a blog describing the problem and our solution, which you can preview here: https://webtide.com/?p=2870&preview=1&_ppp=f8c4ae3461 I'd very much appreciate some review/feedback from this forum before I publish that blog - specially if I have: accidentally plagiarised an existing idea; missed something which means I'm fooling myself; badly described the whole thing; etc

cheers

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Greg Young

unread,

Apr 24, 2015, 5:50:45 PM4/24/15

to mechanica...@googlegroups.com

Out of curiosity how much is want of the attribute due to lack of a repl

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Richard Warburton

unread,

Apr 24, 2015, 5:59:48 PM4/24/15

to mechanica...@googlegroups.com

Hi,

This is what I hate about slides taken on their own and not supporting a presentation. I think Slide Share is so wrong :-) A presentation is a presentation and a document is a document.

"Persistent" in this sense is the functional programming definition and not the storage to disk.

A FP persistent data structure does not mutate under the reader. For the period of time they are valid then they are persistent from the FP perspective.

Some people use the term "append-only" here.

regards,

Richard Warburton

http://insightfullogic.com

@RichardWarburto

Aleksey Shipilev

unread,

Apr 24, 2015, 7:01:42 PM4/24/15

to mechanica...@googlegroups.com

On 04/25/2015 12:15 AM, ymo wrote:
> Ok i understand if it is not supported but i would think that people
> want to put the @Perf annotation on the actual method with the name of
> the particular counter they are interested.

The obsession with annotations puzzles me.

Have you seen some other tools that are successful with "profile me"
annotations? For example, how would those tools handle the inlined
methods, especially when callee code gets mixed with the caller code?

One could argue that annotation could be helpful to contrast out the
counters belonging to a particular method. But, if the target method is
hot, it will show up in (whatever is your) HWC profiler anyway. If the
method is cold, then the hardware counter sampling is heavily biased
towards hot methods, and the target method only gets a few samples --
that is not enough for any accurate measurement.

Also, most "hardware counters" count the processor-wide events, not the
"number of events caused by this instruction". Contrasting out how much
exactly a particular counter changed within a particular method requires
tricky guesstimation.

This is why "perfasm" is only showing the relative distribution of the
events over the code. If you want (why?) the absolute values for
hardware counters, even with the caveats described above, take a good
hardware counter profiler (e.g. Solaris Studio Performance Analyzer).

You can measure how many HW events happen per benchmark op with some
good precision, because you can "simply" divide a large number of HW
events over the large number of benchmark ops. That's what "perfnorm" does.

> And get an aggregate that does not include the warmup time. That
> would make more sense to me. But maybe i don't have the whole picture
> on the tool and the target audience.

Well, -prof perf/perfasm/perfnorm do not include warmup time data, if
perf supports delaying and/or incremental reporting. That's the reason
why these things are integrated with the harness itself -- it can tell
the boun

-Aleksey

signature.asc

Greg Wilkins

unread,

Apr 24, 2015, 7:14:29 PM4/24/15

to mechanica...@googlegroups.com

On Saturday, 25 April 2015 00:42:39 UTC+10, ymo wrote:

I kind of don't understand the JMH benchmark altogether. Specially about the part where you have .threads(2000) ?

Well I've not got my head into the JMH model, so this could be wrong ... but the default for threads is the number of cores and as each thread is simulating a TestConnection, I wanted a test that had a lot more simulated connections than that - hence the 2000.

For the sake of sanity here is what i do :
1) run one producer on one core
2) run all consumers on other cores

But that is precisely the situation that my scheduler is trying to avoid. After the producer has parsed a HTTP/2 frame it's cache is hot with all the requests details, so it is that core that I want to consume the request - not another one. I can use another core to continue parsing any input that is still in the buffer... OK that will also have the same cache issue, but probably only a single cache line or a bit of look ahead in the buffer - not an entire complex request object and all the things that hang off it.

3) make sure i am always on the same socket
4) make sure the kernel is not scheduling anything on those cores

It is a benchmark after all but at least it removes all the other variables from the equation to only test how new consumers are processing your requests. makes sense ?

Sure, benchmarks are run with minimal other load on the system. I'm not seeing any significant variation in the runs, so I do not believe they are being affected by other processes.

Greg Wilkins

unread,

Apr 24, 2015, 8:16:58 PM4/24/15

to mechanica...@googlegroups.com

On Saturday, 25 April 2015 00:37:33 UTC+10, Martin Thompson wrote:

Aeron .... This thread copies the data into data structures that are append only and can be ready by multiple threads without any locks or full fences.

So I think this is a different problem to what I'm trying to solve with EWYK. Martin is solving the problem of getting data into a system that needs to consume it multiple times, hence handing it over to other threads in such read-many lock free ways is key.

But for the HTTP server case, we don't need to consume the data multiple times by different threads - at least not how a traditional servlet server is set up. Once parsed, we typically will handle the request once and only once - dispatching it to the servlet container with a synchronous thread to generate a response.

The theory that I'm putting forward is that in such cases it is better to get the same thread to do the processing rather than use and efficient hand over to another thread. Instead hand over to another thread to continue producing.

Martin Thompson

unread,

Apr 25, 2015, 5:51:00 AM4/25/15

to mechanica...@googlegroups.com

On Friday, 24 April 2015 22:59:48 UTC+1, Richard Warburton wrote:

Hi,

This is what I hate about slides taken on their own and not supporting a presentation. I think Slide Share is so wrong :-) A presentation is a presentation and a document is a document.

"Persistent" in this sense is the functional programming definition and not the storage to disk.

A FP persistent data structure does not mutate under the reader. For the period of time they are valid then they are persistent from the FP perspective.

Some people use the term "append-only" here.

True some people say append-only. It is one means of achieving a persistent data structure. The other major technique is path-copy. Persistent is the abstract term like List is in Java, with append-only or path-copy as implementations, just like a List can be array backed or linked nodes.

The key to persistence in this sense is immutability from the readers perspective. Immutability makes things great for reasoning about and if you consider a reasonable time/space window then quite powerful things can be built that also afford great performance. For example, append-only can play very well with hardware prefetchers.

Martin...

Benedict Elliott Smith

unread,

Apr 27, 2015, 9:45:54 AM4/27/15

to mechanica...@googlegroups.com

The theory that I'm putting forward is that in such cases it is better to get the same thread to do the processing rather than use and efficient hand over to another thread. Instead hand over to another thread to continue producing.

I suspect the improvement you are seeing isn't based on which thread does the processing, but down to efficiently saturating the available CPUs. If serving the servlet takes less time than a call to LockSupport.unpark() * num cores, then you never fully utilise all of your cores with a single network consumer. In your design, this cost is steadily spread out across all of the cores, so that you more rapidly saturate them. This is a similar optimisation to that delivered by the injector, but the injector tries to take it one step further and eliminate some of these calls entirely.

Greg Wilkins

unread,

Apr 27, 2015, 7:36:45 PM4/27/15

to mechanica...@googlegroups.com

On 27 April 2015 at 23:45, Benedict Elliott Smith <b.ellio...@gmail.com> wrote:

The theory that I'm putting forward is that in such cases it is better to get the same thread to do the processing rather than use and efficient hand over to another thread. Instead hand over to another thread to continue producing.

I suspect the improvement you are seeing isn't based on which thread does the processing, but down to efficiently saturating the available CPUs. If serving the servlet takes less time than a call to LockSupport.unpark() * num cores, then you never fully utilise all of your cores with a single network consumer. In your design, this cost is steadily spread out across all of the cores, so that you more rapidly saturate them. This is a similar optimisation to that delivered by the injector, but the injector tries to take it one step further and eliminate some of these calls entirely.

Benedict,

the reason I suspect that it is the cache misses that cause the slow down is that we have seen exactly that problem in a previous iteration of Jetty when we moved a dispatch from before parsing to between parsing and handling for HTTP/1. We then analysed that slow down and saw that much/most of it was a result of cache misses.

See https://webtide.com/avoiding-parallel-slowdown-in-jetty-9/

of course that analysis may also be wrong :)

cheers

--

Greg Wilkins <gr...@intalio.com> @ Webtide - an Intalio subsidiary
http://eclipse.org/jetty HTTP, SPDY, Websocket server and client that scales
http://www.webtide.com advice and support for jetty and cometd.

Benedict Elliott Smith

unread,

Apr 27, 2015, 8:01:28 PM4/27/15

to mechanica...@googlegroups.com

That analysis looks to be independent (and quite old)? The link to your new post has expired, but my recollection was that you saw many multiples (say, 6x) improvement in performance. Since your IPC was ~0.75 at the time, this would require improving it to 4 by the elimination of just 8% of L1 dcache misses, and 5% of LLC misses. An IPC of 4 in a framework like that would be pretty impressive by itself, but the larger reduction from 14% and 11% respectively only improved by 50%, from 0.5 to 0.75. Admittedly there are 20% misses from L1 icache on the table, but I'm not convinced this would likely be helped significantly in the strategy you outline, since the body of code being executed by any thread/core at any moment is larger, not smaller, under this scheme.

--

Greg Wilkins

unread,

Apr 27, 2015, 9:32:33 PM4/27/15

to mechanica...@googlegroups.com

Benedict,

Oh for sure the 8x improvement in performance was due to mostly keeping the CPUs busier. That's why I also showed the normalised throughput vs load. Non-busy CPUs can be used for other tasks (if they are available), so may not hurt total throughput, but the normalised throughput shows that more CPU time is being used per task. My theory is that part of that extra time is the dispatch mechanism itself, but that part of it is also the execution with cold caches.

I've published the blog now (nobody said I was totally dreaming... so publish and be damned), feedback still welcome and I'll see if I can followup with some better latency measurements.

https://webtide.com/eat-what-you-kill/

cheers

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/xUb9S0Rl6L4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Andy Smith

unread,

Apr 28, 2015, 7:43:09 AM4/28/15

to mechanica...@googlegroups.com

I think 'append-only' might be the more appropriate term. The persistence property is a superset of immutability; i.e. when a persistent structure is 'modified', a new version of the structure is created with the modification applied, but the key point is that the original remains unmodified + immutable before/during+after the modification - i.e. the original version 'persists' after the operation. I don't think the Aeron structures posses ( or require!) this property.

I think it makes sense to describe the aeron log structures as an append-only collection of logically immutable values, but describing them as 'persistent' might confuse a few FP weenies :-)

Cheers,

A.

Martin Thompson

unread,

Apr 28, 2015, 8:58:59 AM4/28/15

to mechanica...@googlegroups.com

On 28 April 2015 at 12:43, Andy Smith <andyr...@gmail.com> wrote:

This is what I hate about slides taken on their own and not supporting a presentation. I think Slide Share is so wrong :-) A presentation is a presentation and a document is a document.

"Persistent" in this sense is the functional programming definition and not the storage to disk.

A FP persistent data structure does not mutate under the reader. For the period of time they are valid then they are persistent from the FP perspective.

Some people use the term "append-only" here.

True some people say append-only. It is one means of achieving a persistent data structure. The other major technique is path-copy. Persistent is the abstract term like List is in Java, with append-only or path-copy as implementations, just like a List can be array backed or linked nodes.

The key to persistence in this sense is immutability from the readers perspective. Immutability makes things great for reasoning about and if you consider a reasonable time/space window then quite powerful things can be built that also afford great performance. For example, append-only can play very well with hardware prefetchers.

I think 'append-only' might be the more appropriate term. The persistence property is a superset of immutability; i.e. when a persistent structure is 'modified', a new version of the structure is created with the modification applied, but the key point is that the original remains unmodified + immutable before/during+after the modification - i.e. the original version 'persists' after the operation. I don't think the Aeron structures posses ( or require!) this property.

I think it makes sense to describe the aeron log structures as an append-only collection of logically immutable values, but describing them as 'persistent' might confuse a few FP weenies :-)

It feels like append-only would help clarity all round!

Reply all

Reply to author

Forward