the real latency performance killer

Robert Engels

unread,

Apr 17, 2014, 12:46:26 PM4/17/14

to mechanica...@googlegroups.com

I was referred to this group by a colleague, and the participants are certainly more knowledgeable than myself, but I'd like to throw out my two cents anyway.

As a recent blog post of my showed, Java easily outperforms C++ in real-world tests, but even this test is flawed...

The problems is that even though this is a real-world test, doing "real" work, it is still essentially a micro-benchmark. Why?

Which brings me to the heart of the problem... Here are the memory access times on a typical modern processor:

Core i7 Xeon 5500 Series Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles remote
remote L3 CACHE ~100-300 cycles
Local Dram ~60 ns
Remote Dram ~100 ns

So with my "real-world" test, the heart of the code path is always in level 1 cache, with the predicative loading of the cache when the message object is retrieved.

Now, compare this with a true real-world application, with gigabytes of heap. Most modern processors have about 20 mb of shared level 3 cache, which is a fraction of the memory in use, so when the garbage collector is moving things around, and or background "house-keeping" tasks are doing their work, they are blowing out the CPU caches, (even the non-shared L2 cache is destroyed by a compacting garbage collector). Even isolated CPUs don't help with the latter.

So when your low-frequency, but low-latency (say sending an order in response to some market event), this code is going to run 5x (or more if NUMA is involved) slower than the micro-benchmark case due to the non-cached main memory access.

How do we fix this? 2 ways.

With CPU support for "non-cached reads and writes", a thread or (possibly a class/object) can be marked as "background", and then memory access by this thread/class do not go through the cache, hopefully preserving the L2/L3 cache for the "important" threads.

Similarly, an object/class marked "important" is a clue to the garbage collector to not move this object around if at all possible. This can sort of be solved now with off-heap memory structures, but they're are pain (at least in the current incarnation).

Without something similar to the above, I just don't think low-frequency and low-latency is possible.

Robert Engels

unread,

Apr 17, 2014, 12:59:16 PM4/17/14

to mechanica...@googlegroups.com

Also, this paper has some really good research on the problem.

Ross Bencina

unread,

Apr 17, 2014, 1:38:59 PM4/17/14

to mechanica...@googlegroups.com

On 18/04/2014 2:46 AM, Robert Engels wrote:
> How do we fix this? 2 ways.
>
> With CPU support for "non-cached reads and writes", a thread or
> (possibly a class/object) can be marked as "background", and then memory
> access by this thread/class do not go through the cache, hopefully
> preserving the L2/L3 cache for the "important" threads.
>
> Similarly, an object/class marked "important" is a clue to the garbage
> collector to not move this object around if at all possible. This can
> sort of be solved now with off-heap memory structures, but they're are
> pain (at least in the current incarnation).
>
> Without something similar to the above, I just don't think low-frequency
> and low-latency is possible.

Another trick to add to your bag is to partition your memory layout
based on cache associativity sets. These guys got some performance
improvement in their real-time memory allocator:

http://www.cister.isep.ipp.pt/ecrts11/prog/CAMAaPredictableCacheAwareMemoryAllocator.pdf

Key quote:

"Store descriptors only in memory locations mapped to a known,
bounded range of cache sets!"

Ross.

Vitaly Davidovich

unread,

Apr 17, 2014, 1:41:14 PM4/17/14

to mechanica...@googlegroups.com

There are only a few cases where java will beat c++. Now, as you state in your blog, it's easier and faster to write a better performing java version of some algorithm, but there are many more optimization opportunities exposed in c++ than java. So for a skilled c++ developer who is sympathetic to machines, they'll most likely outpace java. Not to speak of amount of memory both servers will take. Also, java is commonly plenty fast, true - the issue most people fight against in low latency is unpredictability of GC. So both groups end up trying to avoid allocations: java guys with gc, c++ with malloc/new.

The cache issues you mention and generally the discrepancy between core and memory speed is primary reason java needs facilities to shrink footprint and stop chasing pointers.

Sent from my phone

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Robert Engels

unread,

Apr 17, 2014, 2:00:03 PM4/17/14

to mechanica...@googlegroups.com

I disagree somewhat, but it really depends on the use-case. In an enterprise type application with GBs of live data, without some macro level support of cache usage, your going to have a problem.

This is why really low-latency systems (even with Java), use separate processes in order to better isolate cache usage, but even here, unless your complete data set fits in the unshared L2 cache, you're going to have a problem as the Level 3 cache is going to be blown by other "enterprise" processes running on the other cores. You can isolate these processes onto different machines and pay the network penalty, or you need to not just do core isolation, but CPU isolation, and that gets expensive real quick...

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vitaly Davidovich

unread,

Apr 17, 2014, 2:12:34 PM4/17/14

to mechanica...@googlegroups.com

I don't see how separate processes helps to isolate cache usage. What some folks do is flat out divvy up the machine: interrupts are masked out to run on a subset of cores, some processes are then affinitized to run on other cores, etc. Maybe that's what you meant, and it is a headache to maintain these setups.

Not sure which part you disagree with, but I'm sure we all agree that java could use a diet for data representation. Irrespective of other things, it'd be nice if more stuff fit in cache to begin with before we start worrying about cache misses and the like.

Sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Martin Thompson

unread,

Apr 17, 2014, 2:15:13 PM4/17/14

to mechanica...@googlegroups.com

When it comes to memory access performance 4 major things matter:

Volume of data you are shifting, but this is becoming less of an issue with every generation as bandwidth keeps taking huge strides forward.
Locality: If you are in the same cache line or page then you benefit from warm data caches, TLB caches, and DDR sense amplifier row buffers.
Predictable access patterns mean the prefetchers can hide the latency by prefeching the data in time for your instructions needing it. Pointer chasing is bad.
Non-uniform memory access (NUMA) effects. When crossing interconnects between sockets you need to add 20ns for each one way hop, and depending on your CPU version you may not get prefetch support and be subject to unexpected writebacks. You need to get use to the likes of numactl and cgroups to ensure your processes run and access memory where you expect.

If your code shows no sympathy to the memory subsytems then you can pay a big performance price.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Robert Engels

unread,

Apr 17, 2014, 2:23:34 PM4/17/14

to mechanica...@googlegroups.com

It's mainly for Java, but can apply large C applications as well with certain allocators. By splitting the large process into multiple smaller ones, then isolating them, heap access by some tasks can't affect the cache of the other smaller "important" tasks.

For instance, the "order engine" could be an isolated JVM, with a very small heap, doing little garbage generation (or certainly a small portion of the heap). You are effectively isolating it's heap usage to fit within the L2 of the isolated core, so even if the garbage collector compacts the heap, it still resides in the L2.

I disagree with the general statement that you can write faster code in C than Java,,, when you add other real world restrictions like time to market, architecture flexability, correctness, etc.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Robert Engels

unread,

Apr 17, 2014, 2:27:48 PM4/17/14

to mechanica...@googlegroups.com

Agreed. I also think that the JVM can provide real gains here with dynamic memory analysis, rather than static compiler optimizations (and manually laying out of shared structures / cache lines, etc. by the developer).

People knock the higher level abstractions in Java, and continually want lower direct access to the hardware, because people think they are smarter than the machine. People will eventually figure out that using higher level abstractions will lead to better performance when they let go of their egos.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Robert Engels

unread,

Apr 17, 2014, 2:35:19 PM4/17/14

to mechanica...@googlegroups.com

Also, if the hardware engineers could figure out a way to make main memory access as fast as L1 cache, all these problems go away...

On Thursday, April 17, 2014 11:46:26 AM UTC-5, Robert Engels wrote:

Vitaly Davidovich

unread,

Apr 17, 2014, 2:40:38 PM4/17/14

to mechanica...@googlegroups.com

It's not that people think they're smarter than machines (although, that's true as well - people are the ones designing them, machines are just much quicker than humans). The issue is that as you go lower in the stack, things become more and more generalized. As a developer of some system, you typically know a lot more about data and its flow than anything lower than you. In those cases, you want more control because, well, you happen to know more about your usecase. The machine can pick up some patterns automatically (e.g. branch prediction, prefetch, etc), but they're going to be general "obvious" patterns.

The idea of having JVM do dynamic layout based on some cpu feedback has been brought up before, but this is a hard problem. What happens if workload changes? Are you going to re-layout everything? Leave it be? How is this data going to be collected and for what memory access? What is the perf implocation, cpu + mem? What happens if the profile collected is not indicative of the most optimal layout? This is already an issue with JIT compilation as its dynamic nature is a blessing and a curse.

It's nice to have default tuning done for you, but for high perf scenarios, there needs to be manual control exposed.

Sent from my phone

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Kirk Pepperdine

unread,

Apr 17, 2014, 2:53:34 PM4/17/14

to mechanica...@googlegroups.com

On Apr 17, 2014, at 8:23 PM, Robert Engels <ren...@ix.netcom.com> wrote:

> It's mainly for Java, but can apply large C applications as well with certain allocators. By splitting the large process into multiple smaller ones, then isolating them, heap access by some tasks can't affect the cache of the other smaller "important" tasks.

I don’t see how this will help. Threads within a process or in different processes will have the same over all effect on the cores and associated cache. Threads and processes are abstractions or ways of organizing things for humans. By the time they hit a core….

Regards,
Kirk

Vitaly Davidovich

unread,

Apr 17, 2014, 2:59:51 PM4/17/14

to mechanica...@googlegroups.com

Fit into L2? L2 is like 512kb - 1mb.

The time to market argument in favor of java vs c++ is only relevant, I think, if you're constantly starting from scratch. Architecture flexibility is really an artifact of engineers working on the project; I've seen horror shows in java as well, this is not a language issue. Correctness -- hmm, a bit hard to say. There's certainly a class of errors you won't see in java, but don't know if that's what you mean by correctness. The actual business logic correctness is, again, in the hands of developers on the project. c++ is a tricky language as a whole, but one doesn't have to use the entire language. Also, compiler support and static analysis tools are getting better there as well.

Having said all that, I'm a huge fan of the JVM; I think it's an excellent piece of engineering, and given all that it provides, the speed code can run at is pretty impressive. And in a lot of cases, it's fast *enough*. However, in domains where ultimate speed/efficiency == $$ (whether via direct means or indirect, such as requiring fewer machines), it can pay to squeeze as much as possible out of a machine.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Robert Engels

unread,

Apr 17, 2014, 3:02:55 PM4/17/14

to mechanica...@googlegroups.com

But as you start making those decisions lower and lower, your software becomes very rigid, and unable to respond to changes in architecture, usages, etc. so when you factor in all of the other real world concerns, I just don't buy that your going to consistently out-perform a generalized, easy to change system (where there are hundreds of developers continually improving the generalized internals).

As far as the machines only being faster, not smarter, I'm not sure I buy that either. Take a 30 variable multiple regression on a large data set. Yes, the human designed the system, but he could never solve it without the machine... if you're too slow to matter, you might as well be dumb too... (and then there is all of the machine learning, and genetic algorithms stuff which is a whole other topic ...)

-----Original Message-----
From: Vitaly Davidovich
Sent: Apr 17, 2014 1:40 PM
To: mechanica...@googlegroups.com
Subject: Re: the real latency performance killer

It's not that people think they're smarter than machines (although, that's true as well - people are the ones designing them, machines are just much quicker than humans). The issue is that as you go lower in the stack, things become more and more generalized. As a developer of some system, you typically know a lot more about data and its flow than anything lower than you. In those cases, you want more control because, well, you happen to know more about your usecase. The machine can pick up some patterns automatically (e.g. branch prediction, prefetch, etc), but they're going to be general "obvious" patterns.

The idea of having JVM do dynamic layout based on some cpu feedback has been brought up before, but this is a hard problem. What happens if workload changes? Are you going to re-layout everything? Leave it be? How is this data going to be collected and for what memory access? What is the perf implocation, cpu + mem? What happens if the profile collected is not indicative of the most optimal layout? This is already an issue with JIT compilation as its dynamic nature is a blessing and a curse.

It's nice to have default tuning done for you, but for high perf scenarios, there needs to be manual control exposed.

Sent from my phone

On Apr 17, 2014 2:27 PM, "Robert Engels" <ren...@ix.netcom.com> wrote:

Agreed. I also think that the JVM can provide real gains here with dynamic memory analysis, rather than static compiler optimizations (and manually laying out of shared structures / cache lines, etc. by the developer).

People knock the higher level abstractions in Java, and continually want lower direct access to the hardware, because people think they are smarter than the machine. People will eventually figure out that using higher level abstractions will lead to better performance when they let go of their egos.

On Thursday, April 17, 2014 1:15:13 PM UTC-5, Martin Thompson wrote:

When it comes to memory access performance 4 major things matter:

Volume of data you are shifting, but this is becoming less of an issue with every generation as bandwidth keeps taking huge strides forward.

Locality: If you are in the same cache line or page then you benefit from warm data caches, TLB caches, and DDR sense amplifier row buffers.
Predictable access patterns mean the prefetchers can hide the latency by prefeching the data in time for your instructions needing it. Pointer chasing is bad.

Non-uniform memory access (NUMA) effects. When crossing interconnects between sockets you need to add 20ns for each one way hop, and depending on your CPU version you may not get prefetch support and be subject to unexpected writebacks. You need to get use to the likes of numactl and cgroups to ensure your processes run and access memory where you expect.

If your code shows no sympathy to the memory subsytems then you can pay a big performance price.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Vitaly Davidovich

unread,

Apr 17, 2014, 3:03:17 PM4/17/14

to mechanica...@googlegroups.com

I think Robert was implying affinitizing the process(es) to run on only certain (non-overlapping) cpus; that's the "then isolating them" part. At least that's how I understood it, in which case, there are cases where such a scenario helps. With JVM processes, this is somewhat of an issue because now each of these processes incurs the same JVM overhead repeatedly, thus reducing the machine's capacity.

On Thu, Apr 17, 2014 at 2:53 PM, Kirk Pepperdine <ki...@kodewerk.com> wrote:

On Apr 17, 2014, at 8:23 PM, Robert Engels <ren...@ix.netcom.com> wrote:

> It's mainly for Java, but can apply large C applications as well with certain allocators. By splitting the large process into multiple smaller ones, then isolating them, heap access by some tasks can't affect the cache of the other smaller "important" tasks.

I don't see how this will help. Threads within a process or in different processes will have the same over all effect on the cores and associated cache. Threads and processes are abstractions or ways of organizing things for humans. By the time they hit a core....

Regards,
Kirk

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Robert Engels

unread,

Apr 17, 2014, 3:07:20 PM4/17/14

to mechanica...@googlegroups.com

Not true. If you isolate the core, and run the smaller JVM with small memory footprint on a single core (and nothing else on that core), then you have the L1 and L2 isolated from all other activity, and any compaction still results in the object being in the L2 cache.

-----Original Message-----
>From: Kirk Pepperdine <ki...@kodewerk.com>
>Sent: Apr 17, 2014 1:53 PM
>To: "mechanica...@googlegroups.com" <mechanica...@googlegroups.com>
>Subject: Re: the real latency performance killer
>
>

>On Apr 17, 2014, at 8:23 PM, Robert Engels <ren...@ix.netcom.com> wrote:
>
>> It's mainly for Java, but can apply large C applications as well with certain allocators. By splitting the large process into multiple smaller ones, then isolating them, heap access by some tasks can't affect the cache of the other smaller "important" tasks.
>
>I don't see how this will help. Threads within a process or in different processes will have the same over all effect on the cores and associated cache. Threads and processes are abstractions or ways of organizing things for humans. By the time they hit a core....
>
>Regards,
>Kirk
>
>--

>You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
>To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.
>To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Martin Thompson

unread,

Apr 17, 2014, 3:11:50 PM4/17/14

to mechanica...@googlegroups.com

This is a pipe dream. If you have studied modern hardware you realised the hierarchy will get deeper and we are moving to core local and tiled memory.

Robert Engels

unread,

Apr 17, 2014, 3:21:53 PM4/17/14

to mechanica...@googlegroups.com

I am certainly not a hardware guru by any means, but I recall people thinking 14 nano-meter CPUs were a pipe dream too... and now we're talking 10 nm...

-----Original Message-----
From: Martin Thompson
Sent: Apr 17, 2014 2:11 PM
To: mechanica...@googlegroups.com
Subject: Re: the real latency performance killer

This is a pipe dream. If you have studied modern hardware you realised the hierarchy will get deeper and we are moving to core local and tiled memory.

On 17 April 2014 19:35, Robert Engels <ren...@ix.netcom.com> wrote:

Also, if the hardware engineers could figure out a way to make main memory access as fast as L1 cache, all these problems go away...

--

Martin Thompson

unread,

Apr 17, 2014, 3:36:01 PM4/17/14

to mechanica...@googlegroups.com

I talk to hardware folk and cache hierarchies are getting deeper and innovation is looking at local memory to CPUs rather than huge shared memories.

We can always be surprised but there is nothing in the pipeline that suggests we are going to get large memory spaces at the 3-4 cycle response times of L1 caches.

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Robert Engels

unread,

Apr 17, 2014, 3:42:50 PM4/17/14

to mechanica...@googlegroups.com

I agree that that is the likely direction (my original comment was intended as a joke), but that even makes more of a case for higher level abstractions to take advantage of huge (> 1024 core) machines with larger local caches.

With higher abstractions it becomes much easier to break processes apart and transparently integrate when needed (sometimes with no developer effort, everything is RMI, etc.), and let the OS/JVM figure out what to run where.

Trying to do this manually with massively parallel machines is very difficult.

Martin Thompson

unread,

Apr 17, 2014, 3:49:04 PM4/17/14

to mechanica...@googlegroups.com

I think some of the really interesting work on high-level abstractions in this area is on "Cache Oblivious Algorithms".

Here is a nice blog on potential speedup.

http://www.1024cores.net/home/parallel-computing/cache-oblivious-algorithms

Vitaly Davidovich

unread,

Apr 17, 2014, 3:49:30 PM4/17/14

to mechanica...@googlegroups.com

There's already plenty of abstraction, through all layers, hardware to software. You basically want something that given any random app X, the runtime will special case it in a nearly optimal way automatically and on-the-fly; and do it in most performant way; and do it everytime. That's not going to happen - it'll get you the 80%, we need control over the other 20 (or even less, but there remains a need for manual control over the small yet important/hot percent of the codebase).

Sent from my phone

Robert Engels

unread,

Apr 17, 2014, 3:58:54 PM4/17/14

to mechanica...@googlegroups.com

I disagree. I'll let someone hand-tune (develop) to the exact configuration, hardware, software, etc. I'll write the code in a more generic manner, and I'll take advantage of every hardware generations improved capability far faster than the other - the other will always be behind in terms of performance because of this... (that being said, some configuration is needed today, as we haven't gotten that far along...)

Today, I can move our application to an IBM power 7 series, with 5 GHZ processors, and double the performance - all without changing a single line of code - not even a recompile.... Even if the machine costs a million dollars, how many developer years are saved...

-----Original Message-----
From: Vitaly Davidovich

Robert Engels

unread,

Apr 17, 2014, 3:59:21 PM4/17/14

to mechanica...@googlegroups.com

Interesting, thanks !

-----Original Message-----
From: Martin Thompson
Sent: Apr 17, 2014 2:49 PM
To: mechanica...@googlegroups.com
Subject: Re: the real latency performance killer

I think some of the really interesting work on high-level abstractions in this area is on "Cache Oblivious Algorithms".

Here is a nice blog on potential speedup.

http://www.1024cores.net/home/parallel-computing/cache-oblivious-algorithms

On 17 April 2014 20:42, Robert Engels <ren...@ix.netcom.com> wrote:

I agree that that is the likely direction (my original comment was intended as a joke), but that even makes more of a case for higher level abstractions to take advantage of huge (> 1024 core) machines with larger local caches.

With higher abstractions it becomes much easier to break processes apart and transparently integrate when needed (sometimes with no developer effort, everything is RMI, etc.), and let the OS/JVM figure out what to run where.

Trying to do this manually with massively parallel machines is very difficult.

--

Vitaly Davidovich

unread,

Apr 17, 2014, 4:01:57 PM4/17/14

to mechanica...@googlegroups.com

ok :)

As an aside, I hope you realize that clock speed alone has stopped being a principal performance factor for a few generations of processors at this point.

Robert Engels

unread,

Apr 17, 2014, 5:54:00 PM4/17/14

to mechanica...@googlegroups.com

Btw, just saw this from the Power 7 Wikipedia page...

One feature that IBM and DARPA collaborated on is modifying the addressing and page table hardware to support global shared memory space for POWER7 clusters. This enables research scientists to program a cluster as if it were a single system, without using message passing. From a productivity standpoint, this is essential since some scientists are not conversant with MPI or other parallel programming techniques used in clusters.^[5]

Gil Tene

unread,

Apr 17, 2014, 8:46:45 PM4/17/14

to mechanica...@googlegroups.com, Robert Engels

Unfortunately, It's Not Not true. ;-)

If you can isolate your small (or large) process (or set of threads) to a separate socket, you are protected from your cache being interfered with by anything not accessing it's contents.

But when you isolate a single core within a modern Xeon socket, your L1 and L2 are not isolated, and your noisy in-socket neighbors will still hurt you. The L3 on Xeons is inclusive of L2 and L1. When an LRU L3 line is evicted to make room for a newly read one, associated L2 and/or L1 contents go away with it.

So unless you dedicate an entire socket to your isolated process, you next best bet is to avoid going idle, while keeping your L1 and L2 warm by having your "idle loop" repeatedly access all the stuff you may need even when you don't need it. This won't prevent neighbor-driven eviction, but will have a much higher likelihood of pre-recovering from it before you actually miss in the L1/L2 when you care about it.

BTW, within Java VMs, you can separate the VM threads (mostly the GC threads) to run on a separate socket, which will keep them from thrashing your cache when they do their background work. This is actually fairly practical even with multiple JVMs and per-core isolation, since you can put all the JVMs GC threads on the "system" socket and keep all your application threads (in all JVMs) in the dedicated socket. I know people who actually do this...

Separately, if you do the "keep my cached stuff warm by accessing or modifying it all the time" thing in your per-thread isolated CPUs, even compaction/relocation of your objects by the GC don't hurt much, as your relocated objects will be pre-recovered back into L1/L2 just t=like they would if a neighbor process caused eviction through mere L3 pressure.

>To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

robert engels

unread,

Apr 17, 2014, 9:45:36 PM4/17/14

to Gil Tene, mechanica...@googlegroups.com

You are very correct... my bad.

We use that exact setup to avoid the GC threads thrashing the cache, but the GC still moves objects around which destroys the cache anyway...

Keeping needed code and cache hot (in an idle loop sense) is not always possible... (or often not easy to do without very ugly code).

To clarify on your first point though, if the other cores all have working sets within their L2 cache size (or less restrictive, their total working sets within L3 - sizeof(L2)), aren't you essentially protecting the L2 and L1 used by the isolated core?

But to bring us back where I started, this is why hardware support for non-cached read/writes with the ability to control which threads/classes use these calls might be helpful.

Your points also seem to suggest that specialized/embedded systems would probably do better with non-inclusive L3 caches - protecting the per core L2 based on the configured loads. Do you know any architectures using such a setup?

Kirk Pepperdine

unread,

Apr 18, 2014, 2:46:35 AM4/18/14

to mechanica...@googlegroups.com

Even with this level of isolation.. at some level you’ll be sharing and once you share you’ll have to deal with contention.

> You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,

Apr 18, 2014, 2:55:15 AM4/18/14

to

Follow up answers inline.

On Thursday, April 17, 2014 6:45:36 PM UTC-7, Robert Engels wrote:

You are very correct... my bad.

We use that exact setup to avoid the GC threads thrashing the cache, but the GC still moves objects around which destroys the cache anyway...

Keeping needed code and cache hot (in an idle loop sense) is not always possible... (or often not easy to do without very ugly code).

Yes. It's ugly. But not keeping it warm guarantees it won't last long in the presence of other activity in the same socket.

Luckily, both spacial and temporal locality are alive and well in most applications, and hardware prefetchers are really good at dealing with multi-line access patterns, so missing this stuff back into the L1 is not that big a deal (sub usec hits to get back to being warm, usually). But that's just as true when GC kicked your objects out or moved them around...

To clarify on your first point though, if the other cores all have working sets within their L2 cache size (or less restrictive, their total working sets within L3 - sizeof(L2)), aren't you essentially protecting the L2 and L1 used by the isolated core?

Nothing practical (that interacts with the outside world) lives purely within it's L2 cache size. At the very least, any network i/o you are doing will be moving in and out of the L3 cache. After ~20MB of network traffic, or any other memory traffic, all your idle (not actively being hit) L2 and L1 contents will have been thrown away, and will generate new L3 cache misses. So if your isolated core is mostly idle or spinning (which is usually the case), and it does not actively access the contents of it's L2, any other activity in the socket will cause that cold L2 to get thrown away.

But to bring us back where I started, this is why hardware support for non-cached read/writes with the ability to control which threads/classes use these calls might be helpful.

Non-cached read/writes are usually used for specialized i/o operations... They are so expensive (compared to cached ones) that I highly doubt this will be used for anything real (like "everything this thread does is non-cached"). Remember that non-cached also mean non-streaming and non-prefetcheable. It also means that each word or byte access is a separate ~200 cycle memory access. Also remember that stack memory is generally indistinguishable from local memory, and that the CPU has a limited set of registers... That all adds up to "threads/classes that are forced to use non-cacheable memory for everything are not useful for much of anything"

Your points also seem to suggest that specialized/embedded systems would probably do better with non-inclusive L3 caches - protecting the per core L2 based on the configured loads. Do you know any architectures using such a setup?

There is a good reason for the lowest level cache (aka "LLC"; the one closet to memory) being inclusive in pretty much all multi-socket architectures. When the LLC is inclusive of all closer-to-the-cores caches, coherency traffic is only needed between LLCs, and only for coherency state transitions at the LLC level. The hit and miss rates (in ops/sec, not in %) in LLCs are orders of magnitude smaller than those in L1 (and L2 where it exists). If the LLC was not inclusive, state changes in the inner caches would need to be communicated to all other caches, and the cross-socket coherency traffic volume would grow by a couple orders of magnitude, which simply isn't practical with chip-to-chip interconnects and pin counts.

Martin Thompson

unread,

Apr 18, 2014, 3:17:06 AM4/18/14

to mechanica...@googlegroups.com

On 18 April 2014 07:52, Gil Tene <g...@azulsystems.com> wrote:

Your points also seem to suggest that specialized/embedded systems would probably do better with non-inclusive L3 caches - protecting the per core L2 based on the configured loads. Do you know any architectures using such a setup?

There is a good reason for the lowest level cache (aka "LLC"; the one closet to memory) being inclusive in pretty much all multi-socket architectures. When the LLC is inclusive of all closer-to-the-cores caches, coherency traffic is only needed between LLCs, and only for coherency state transitions at the LLC level. The hit and miss rates (in ops/sec, not in %) in LLCs are orders of magnitude smaller than this win L1 (and L2 where it exists). If the LLC was not inclusive, state changes in the inner caches would need to be communicated to all other caches, and the cross-socket coherency traffic volume would grow by a couple orders of magnitude, which simply isn't practical with chip-to-chip interconnects and pin counts.

Gil I'm not sure you are correct here. LLC on Linux normally refers to Last Level Cache, not "lowest". AMDs L3 cache is a mostly exclusive victim buffer. Intel are inclusive at L3 while AMD are mostly exclusive.

Gil Tene

unread,

Apr 18, 2014, 3:48:24 AM4/18/14

to mechanica...@googlegroups.com

The thing AMD bulldozer family processors refer to as L3 is more of a side-cache next to the L2 than a lowest or last level cache. They are of similar size, E.g. 8M of L2 split between 4 pairs, and 8MB of "L3" victim cache. That's a 1x ratio between the total L3 and L2 on a socket, and a 4x ratio between the total L3 and a single L2 (compare that with 10x and 120x ratios on Xeons for the same thing). This is not a "better" or "worse" statement, just a design choice on layering L3 and L2. In both AMD and Xeon right now, L1 is included in the last level caches, and the last level caches are big (multi-MB per core). In the AMD setup, L2 and L3 are both "last level". Any coherency state changes in either is communicated to all other L2 and L3s in the system. Since the AMD L2 is 4-8x as large as the Xeon one, it's sheer size acts as a damper (much lower miss rate). And since the L2 and L3 sizes are similar, the overall miss rate difference between them is inherently not that big, and the effect on coherency traffic volumes is not that high (at these sizes, a 2x ratio does not translate to a 2x reduction in miss rate or a 2x ratio in coherency traffic).

In Xeons (Nehalem and above), the L3 is a proper "last level", and is MUCH larger than the L2, making for the sort of hit and miss rate differences that truly affect coherency traffic volumes.

Martin Grajcar

unread,

Apr 18, 2014, 3:56:45 AM4/18/14

to mechanica...@googlegroups.com

On Fri, Apr 18, 2014 at 8:52 AM, Gil Tene <g...@azulsystems.com> wrote:

Follow up answers inline.

On Thursday, April 17, 2014 6:45:36 PM UTC-7, Robert Engels wrote:

To clarify on your first point though, if the other cores all have working sets within their L2 cache size (or less restrictive, their total working sets within L3 - sizeof(L2)), aren't you essentially protecting the L2 and L1 used by the isolated core?

Nothing practical (that interacts with the outside world) lives purely within it's L2 cache size. At the very least, any network i/o you are doing will be moving in and out of the L3 cache. After ~20MB of network traffic, or any other memory traffic, all your idle (not actively being hit) L2 and L1 contents will have been thrown away, and will generate new L3 cache misses. So if your isolated core is mostly idle or spinning (which is usually the case), and it does not actively access the contents of it's L2, any other activity in the socket will cause that cold L2 to get thrown away.

I wonder if it could make sense for the CPU to optionally protect lines present in any L2 from being replaced in L3? Sort of reserved part of L3 per core. For an example CPU with 4 cores times 256kB L2, it means 1 MB out of 6 MB, surely not a negligible part, but maybe acceptable when you need to minimize latency of the isolated core(s)?

But to bring us back where I started, this is why hardware support for non-cached read/writes with the ability to control which threads/classes use these calls might be helpful.

Non-cached read/writes are usually used for specialized i/o operations... They are so expensive (compared to cached ones) that I highly doubt this will be used for anything real (like "everything this thread does is non-cached"). Remember that non-cached also mean non-streaming and non-prefetcheable. It also means that each word or byte access is a separate ~200 cycle memory access. Also remember that stack memory is generally indistinguishable from local memory, and that the CPU has a limited set of registers... That all adds up to "threads/classes that are forced to use non-cacheable memory for everything are not useful for much of anything"

Could there be something like semi-cacheable? Allow the thread to use only some fraction of cache, so that the operation are not too expensive, and also can't wipe out the whole cache.

Martin Thompson

unread,

Apr 18, 2014, 4:12:09 AM4/18/14

to mechanica...@googlegroups.com

Hi Gil I'm not disagreeing with your point. Just being a pedant before morning coffee kicks in :-)

To support your point from another angle. One of the Intel folk gave me a good description of the L2 cache. They said don't think of L2 so much as a cache, think of it more like a coalescing buffer for queuing updates between L1 and L3. Without it, all the L1 caches would be banging too hard on the L3.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Rüdiger Möller

unread,

Apr 18, 2014, 9:08:32 AM4/18/14

to mechanica...@googlegroups.com

Am Donnerstag, 17. April 2014 20:27:48 UTC+2 schrieb Robert Engels:

Agreed. I also think that the JVM can provide real gains here with dynamic memory analysis, rather than static compiler optimizations (and manually laying out of shared structures / cache lines, etc. by the developer).

People knock the higher level abstractions in Java, and continually want lower direct access to the hardware, because people think they are smarter than the machine. People will eventually figure out that using higher level abstractions will lead to better performance when they let go of their egos.

Depends on the market you are in. In a competitive environment, your "higher level abstractions" Java app will be always behind like 20% (at best) compared to a manually optimized "to-the-metal" application. Actually hardware innovation cycles are not that high. Frequently only some percent of your overall codebase is tweaked to perform on current hardware, so you might overestimate the cost of mechanical sympathy.

robert engels

unread,

Apr 18, 2014, 9:20:29 AM4/18/14

to mechanica...@googlegroups.com

On the non-cached read/writes, I think I am being a bit misunderstood. What I am proposing is to basically be able to a core as a non-SMP core. Obviously the locking and memory fencing needs to be more sophisticated, but I think you can see where I am headed.

Also, one quick question in regards to your statement on network io. Wouldn't the driver/card do the io directly to memory bypassing the cache. And then wouldn't the driver move the memory directly to the mapped buffer space before reading, thereby reusing the addresses line and never affecting the cache. It would seem that just remapping the buffer and destroying the cache in the process would be too expensive overall ? Otherwise it would seem that any high volume network application is never using the L3 cache anyway...

Sent from my iPad

On Apr 18, 2014, at 1:52 AM, Gil Tene <g...@azulsystems.com> wrote:

Follow up answers inline.

On Thursday, April 17, 2014 6:45:36 PM UTC-7, Robert Engels wrote:
You are very correct... my bad.

We use that exact setup to avoid the GC threads thrashing the cache, but the GC still moves objects around which destroys the cache anyway...

Keeping needed code and cache hot (in an idle loop sense) is not always possible... (or often not easy to do without very ugly code).

Yes. It's ugly. But not keeping it warm guarantees it won't last long in the presence of other activity in the same socket.

Luckily, both spacial and temporal locality are alive and well in most applications, and hardware prefetchers are really good at dealing with multi-line access patterns, so missing this stuff back into the L1 is not that big a deal (sub usec hits to get back to being warm, usually). But that's just as true when GC kicked your objects out or moved them around...

To clarify on your first point though, if the other cores all have working sets within their L2 cache size (or less restrictive, their total working sets within L3 - sizeof(L2)), aren't you essentially protecting the L2 and L1 used by the isolated core?

Nothing practical (that interacts with the outside world) lives purely within it's L2 cache size. At the very least, any network i/o you are doing will be moving in and out of the L3 cache. After ~20MB of network traffic, or any other memory traffic, all your idle (not actively being hit) L2 and L1 contents will have been thrown away, and will generate new L3 cache misses. So if your isolated core is mostly idle or spinning (which is usually the case), and it does not actively access the contents of it's L2, any other activity in the socket will cause that cold L2 to get thrown away.

But to bring us back where I started, this is why hardware support for non-cached read/writes with the ability to control which threads/classes use these calls might be helpful.

Non-cached read/writes are usually used for specialized i/o operations... They are so expensive (compared to cached ones) that I highly doubt this will be used for anything real (like "everything this thread does is non-cached"). Remember that non-cached also mean non-streaming and non-prefetcheable. It also means that each word or byte access is a separate ~200 cycle memory access. Also remember that stack memory is generally indistinguishable from local memory, and that the CPU has a limited set of registers... That all adds up to "threads/classes that are forced to use non-cacheable memory for everything are not useful for much of anything"

Your points also seem to suggest that specialized/embedded systems would probably do better with non-inclusive L3 caches - protecting the per core L2 based on the configured loads. Do you know any architectures using such a setup?

There is a good reason for the lowest level cache (aka "LLC"; the one closet to memory) being inclusive in pretty much all multi-socket architectures. When the LLC is inclusive of all closer-to-the-cores caches, coherency traffic is only needed between LLCs, and only for coherency state transitions at the LLC level. The hit and miss rates (in ops/sec, not in %) in LLCs are orders of magnitude smaller than this win L1 (and L2 where it exists). If the LLC was not inclusive, state changes in the inner caches would need to be communicated to all other caches, and the cross-socket coherency traffic volume would grow by a couple orders of magnitude, which simply isn't practical with chip-to-chip interconnects and pin counts.

To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Robert Engels

unread,

Apr 18, 2014, 9:55:57 AM4/18/14

to mechanica...@googlegroups.com

Did you read my blog post that started this? A real world example of why this is just not the case... And I'm certain this is more common than people think.

Another great example is Linux itself. Look at the performances gains that have been made over the years. But its only been possible with massive numbers of developers and massive numbers of bugs.

All most all of the big gains are from algorithmic changes which are often hard to get into the tree... Just to try and ensure correctness, and limit possible cross module affects, and so that the other developers can understand the scope of the changes.

Then you have frameworks like "the disrupter" and when it does real work it is only 1% faster than the standard methods. And now they use Azul Zing anyway because writing gc-less java isn't java and so you my might as well write in C.

Martin Thompson

unread,

Apr 18, 2014, 10:23:29 AM4/18/14

to mechanica...@googlegroups.com

So is your point that the efforts of this group are futile? Is your quest to prove us all wrong and that we should not care for what is under all the abstractions? Without a really compelling case this sort of approach will come across to others as trolling whatever the underlying motivation.

The reason Mechanical Sympathy started in motor racing was because many drivers had reached the point of blindly accepting abstractions and this resulted in not only reduced performance but also greatly increased risk of harm.

We are all a product of our own experience. In my experience showing mechanical sympathy can not just result in enormous performance gains, 3 to 10 fold improvements in response time and throughput is not uncommon, but also much more robust applications.

Every memory system in common use moves data between the levels in its hierarchy in blocks, not bits or bytes, and has done it this way for a long time and will for a long time more. They offer their services by taking bets on temporal, spacial, and patterned based usage. That is all they can offer. Are you arguing that people should not understand these fundamental abstractions? To my mind that is mechanical sympathy. We have abstractions, we don't need to know intimate detail, but we need the appropriate level of detail. Without that appropriate level of detail, not only does performance suffer, code is a lot less robust. I've seen so many bugs in networking and storage code due to a lack of understanding of the basic abstractions.

Abstractions are at their best when small, composable, and fractal. I cringe when people talk about abstractions that are these huge monoliths that do not compose or have fractal characteristics, yes big frameworks I'm looking at you! :-)

Rüdiger Möller

unread,

Apr 18, 2014, 11:08:06 AM4/18/14

to mechanica...@googlegroups.com

Am Freitag, 18. April 2014 15:55:57 UTC+2 schrieb Robert Engels:

Did you read my blog post that started this? A real world example of why this is just not the case... And I'm certain this is more common than people think.

I read, however beating some unknown C++ library with an unknown Java implementation tells me what ?

Another great example is Linux itself. Look at the performances gains that have been made over the years. But its only been possible with massive numbers of developers and massive numbers of bugs.

An OS without mechanical symphathy hardly would have been successful, isn't it ?

All most all of the big gains are from algorithmic changes which are often hard to get into the tree... Just to try and ensure correctness, and limit possible cross module affects, and so that the other developers can understand the scope of the changes.

I regulary speed up programs by factors of 2 to 10 written from people that believe in these popular hoaxes. In many cases choosing the "best algorithm" is trivial, but still one implementation is 5 times faster than another. On the business side: Performance still matters as cloud cost scales pretty much linear with app performance. Operational cost can be an issue if you need to operate a cluster of 5 servers to solve a problem which could be done on a single machine.

Then you have frameworks like "the disrupter" and when it does real work it is only 1% faster than the standard methods. And now they use Azul Zing anyway because writing gc-less java isn't java and so you my might as well write in C.

There is a difference inbetween being "GC'less" and wasting memory like there is no tomorrow. Most developers use abstractions and frameworks without even knowing the cost. However an architectural decision always needs to compare benefit and cost. You are the best example above ("1%",) as when used in the correct place, the speedup of using pipelining compared to naive queuing/pool executors is massive.

No offence, but I sometimes cannot understand why there are so many "performance myths" which are plain out wrong, so many frameworks and libraries are hyped but in reality they beam you back like 10 years performance wise (without need to do so).

Ofc performance might not be the most important thing for many apps, however one should be able to quantify the performance cost if a decision is made to use a specific design pattern, framework or abstraction.

Gil Tene

unread,

Apr 18, 2014, 11:33:23 AM4/18/14

to <mechanical-sympathy@googlegroups.com>

Sent from my iPad

On Apr 18, 2014, at 6:20 AM, "robert engels" <ren...@ix.netcom.com> wrote:

On the non-cached read/writes, I think I am being a bit misunderstood. What I am proposing is to basically be able to a core as a non-SMP core. Obviously the locking and memory fencing needs to be more sophisticated, but I think you can see where I am headed.

What you would need for that is something like dedicated sets (deducated to a thread, or memory space, or CPU) in the L3 cache. Since the L3 is physically indexed and physically tagged, the only practical dedication would be to a core (or to an L2). Some CPUs have variants of this capability, e.g. A way to "lock down" a set of ways in the cache so it is not implicitly evicted and does not aortic inset in LRU. But since the number of ways in the cache is small (e.g. somewhere where between 12 and 24 way depending on model), and they are shared between many cores and threads, dedicating ways is a pretty brutal thing (your cache starts becoming less than 1-way associative with interesting new thrashing behaviors) and needs careful consideration.

Also, one quick question in regards to your statement on network io. Wouldn't the driver/card do the io directly to memory bypassing the cache. And then wouldn't the driver move the memory directly to the mapped buffer space before reading, thereby reusing the addresses line and never affecting the cache. It would seem that just remapping the buffer and destroying the cache in the process would be too expensive overall ? Otherwise it would seem that any high volume network application is never using the L3 cache anyway...

It doesn't matter how the network traffic gets to/from memory from/to the NICs. If any core in your socket interacts with the data in the network traffic at any point, that data would be moving in and out if the L3, evicting other cold data as it goes. That's why with an LRU cache, cold data only survives on idle sockets (ones that don't bring anything into their L3).

There are some interesting cache replacement policy tricks that try to address this to some degree, e.g. Ivy Bridge adds some cool optional config modes for cache replacement, that are no longer LRU, with the goal of surviving scanning operations being one of the drivers. This would apply to network traffic as much as it would to any linear memory scan, or even to rare pointer walking (think of GC), but it could also end up accidentally early-evicting linear moving hot stuff (like new object allocations in Java TLABs).

Some interesting reading to enhance your potential for mechanical sympathy with the cache replacement policy on current Xeons:

[1] http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/

[2] http://www.philippos.info/research/Xi%20Lab%20-%20Cache/Presentations/Part%203%20-%20RRIP.pdf

Robert Engels

unread,

Apr 18, 2014, 11:34:34 AM4/18/14

to mechanica...@googlegroups.com

Whoa. I am definitely not a troll. I came here seeking possible solutions for low-frequency low-latency applications running in the context of a generalized OS / enterprise application.

If I understand Gil correctly, if your application does any sort of serious network IO, and has a large working set, you're essentially screwed.

But other then his seemingly well researched and accurate points, there have been just as many 'use C and get close to the metal otherwise you'll be 20% slower' crap. And it's crap. It may be 20% in micro-benchmarks (or much more), but in real world cases, it's crap.

I used to design video games in hand-coded assembly. They now use scripting languages and OpenGL.

If I sound slightly perturbed I am. I see a lot of "enhancements" coming into the Java language adding much complexity and it isn't going to matter in a real-world app. I would expect this forum's members to be rallying against it, but instead they seem to be the ones promoting it !

A a 30+ year engineer, I know marketing crap. Take the IMAX 'disrupter'. It's crap. In real world test it's only marginally faster, with much greater complexity and constraints.

Similarly, there is a major 'network card provider' that touts their 'kernel bypass technology'. In real world tests, it's actually 30% - 50% slower than competitors "standard" cards running the latest kernel code on a modern processor.

So yea, a lot of the talk in this forum (I've read many of the other recent topics in this forum), seem to be more masturbation than anything. It makes the writer feel good, but isn't going to matter much in the long run.

Show me techniques that work in real applications, and I'm on board.

-----Original Message-----
From: Martin Thompson

Sent: Apr 18, 2014 9:23 AM
To: mechanica...@googlegroups.com
Subject: Re: the real latency performance killer

So is your point that the efforts of this group are futile? Is your quest to prove us all wrong and that we should not care for what is under all the abstractions? Without a really compelling case this sort of approach will come across to others as trolling whatever the underlying motivation.

The reason Mechanical Sympathy started in motor racing was because many drivers had reached the point of blindly accepting abstractions and this resulted in not only reduced performance but also greatly increased risk of harm.

We are all a product of our own experience. In my experience showing mechanical sympathy can not just result in enormous performance gains, 3 to 10 fold improvements in response time and throughput is not uncommon, but also much more robust applications.

Every memory system in common use moves data between the levels in its hierarchy in blocks, not bits or bytes, and has done it this way for a long time and will for a long time more. They offer their services by taking bets on temporal, spacial, and patterned based usage. That is all they can offer. Are you arguing that people should not understand these fundamental abstractions? To my mind that is mechanical sympathy. We have abstractions, we don't need to know intimate detail, but we need the appropriate level of detail. Without that appropriate level of detail, not only does performance suffer, code is a lot less robust. I've seen so many bugs in networking and storage code due to a lack of understanding of the basic abstractions.

Abstractions are at their best when small, composable, and fractal. I cringe when people talk about abstractions that are these huge monoliths that do not compose or have fractal characteristics, yes big frameworks I'm looking at you! :-)

On Friday, 18 April 2014 14:55:57 UTC+1, Robert Engels wrote:

Did you read my blog post that started this? A real world example of why this is just not the case... And I'm certain this is more common than people think.

Another great example is Linux itself. Look at the performances gains that have been made over the years. But its only been possible with massive numbers of developers and massive numbers of bugs.

All most all of the big gains are from algorithmic changes which are often hard to get into the tree... Just to try and ensure correctness, and limit possible cross module affects, and so that the other developers can understand the scope of the changes.

Then you have frameworks like "the disrupter" and when it does real work it is only 1% faster than the standard methods. And now they use Azul Zing anyway because writing gc-less java isn't java and so you my might as well write in C.

On April 18, 2014 8:08:32 AM CDT, "Rüdiger Möller" <moru...@gmail.com> wrote:

Am Donnerstag, 17. April 2014 20:27:48 UTC+2 schrieb Robert Engels:
Agreed. I also think that the JVM can provide real gains here with dynamic memory analysis, rather than static compiler optimizations (and manually laying out of shared structures / cache lines, etc. by the developer).

People knock the higher level abstractions in Java, and continually want lower direct access to the hardware, because people think they are smarter than the machine. People will eventually figure out that using higher level abstractions will lead to better performance when they let go of their egos.

Depends on the market you are in. In a competitive environment, your "higher level abstractions" Java app will be always behind like 20% (at best) compared to a manually optimized "to-the-metal" application. Actually hardware innovation cycles are not that high. Frequently only some percent of your overall codebase is tweaked to perform on current hardware, so you might overestimate the cost of mechanical sympathy.

--

Robert Engels

unread,

Apr 18, 2014, 11:42:08 AM4/18/14

to mechanica...@googlegroups.com

Also, one more case to make my point. There is a leading vendor selling essentially a market data processing appliance running on standardized hardware - with GREAT performance metrics. But then you realize it doesn't support dynamic instrument addition (due to the pre-allocated array lengths, and partitioning), so for a large class of applications it is worthless.

Robert Engels

unread,

Apr 18, 2014, 11:45:38 AM4/18/14

to <mechanical-sympathy@googlegroups.com>

I am slightly confused on

It doesn't matter how the network traffic gets to/from memory from/to the NICs. If any core in your socket interacts with the data in the network traffic at any point, that data would be moving in and out if the L3, evicting other cold data as it goes. That's why with an LRU cache, cold data only survives on idle sockets (ones that don't bring anything into their L3).

Imagine the case where you only had a single buffer of 1k, and that ALL network traffic went through (if you didn't process the packet fast enough it was dropped). How would this destroy the cache? Isn't the buffer going to be at a fixed location in main memory and retain the same physical address mapped in the cache, so at most reading/writing to this buffer could only destroy 1k of the cache ???

-----Original Message-----
From: Gil Tene
Sent: Apr 18, 2014 10:33 AM
To: ""
Subject: Re: the real latency performance killer

Sent from my iPad

Follow up answers inline.

>To: "mechanica...@googlegroups.com" <mechanica...@googlegroups.com>
>Subject: Re: the real latency performance killer
>
>

>On Apr 17, 2014, at 8:23 PM, Robert Engels <ren...@ix.netcom.com> wrote:
>
>> It's mainly for Java, but can apply large C applications as well with certain allocators. By splitting the large process into multiple smaller ones, then isolating them, heap access by some tasks can't affect the cache of the other smaller "important" tasks.
>
>I don't see how this will help. Threads within a process or in different processes will have the same over all effect on the cores and associated cache. Threads and processes are abstractions or ways of organizing things for humans. By the time they hit a core....
>
>Regards,
>Kirk
>

>--
>You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
>To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.

>To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

>For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kirk Pepperdine

unread,

Apr 18, 2014, 11:53:02 AM4/18/14

to mechanica...@googlegroups.com

A a 30+ year engineer, I know marketing crap. Take the IMAX 'disrupter'. It's crap. In real world test it's only marginally faster, with much greater complexity and constraints.

Whoa there Hoss, Calling disrupter crap is a very hostile way to conduct what should be a technical conversation. If disrupter doesn’t happen to fix your needs that’s one thing.. but to call if crap. I’m sorry but you need to calm down.. my man, it’s only tech…

Peace,

Kirk

Robert Engels

unread,

Apr 18, 2014, 11:56:12 AM4/18/14

to mechanica...@googlegroups.com

The library is not unknown. It is widely used, actively maintained, and the source code is available. The java library is proprietary at this time, sorry, so you would just have to take my word for it.

As to your comments on the disrupter - I don't follow. Just have a worker send a 1k UDP packet (even a statically allocated one), in response to every event (using the included sample benchmarks). You will see that the overhead of message passing (which is what the disrupter is an attempt to improve) becomes negligible, and the standard Java version is within 1% of the 'disrupter' version.

-----Original Message-----
From: Rüdiger Möller
Sent: Apr 18, 2014 10:08 AM
To: mechanica...@googlegroups.com
Subject: Re: the real latency performance killer

Am Freitag, 18. April 2014 15:55:57 UTC+2 schrieb Robert Engels:

Did you read my blog post that started this? A real world example of why this is just not the case... And I'm certain this is more common than people think.

I read, however beating some unknown C++ library with an unknown Java implementation tells me what ?

Another great example is Linux itself. Look at the performances gains that have been made over the years. But its only been possible with massive numbers of developers and massive numbers of bugs.

An OS without mechanical symphathy hardly would have been successful, isn't it ?

All most all of the big gains are from algorithmic changes which are often hard to get into the tree... Just to try and ensure correctness, and limit possible cross module affects, and so that the other developers can understand the scope of the changes.

I regulary speed up programs by factors of 2 to 10 written from people that believe in these popular hoaxes. In many cases choosing the "best algorithm" is trivial, but still one implementation is 5 times faster than another. On the business side: Performance still matters as cloud cost scales pretty much linear with app performance. Operational cost can be an issue if you need to operate a cluster of 5 servers to solve a problem which could be done on a single machine.

Then you have frameworks like "the disrupter" and when it does real work it is only 1% faster than the standard methods. And now they use Azul Zing anyway because writing gc-less java isn't java and so you my might as well write in C.

There is a difference inbetween being "GC'less" and wasting memory like there is no tomorrow. Most developers use abstractions and frameworks without even knowing the cost. However an architectural decision always needs to compare benefit and cost. You are the best example above ("1%",) as when used in the correct place, the speedup of using pipelining compared to naive queuing/pool executors is massive.

No offence, but I sometimes cannot understand why there are so many "performance myths" which are plain out wrong, so many frameworks and libraries are hyped but in reality they beam you back like 10 years performance wise (without need to do so).

Ofc performance might not be the most important thing for many apps, however one should be able to quantify the performance cost if a decision is made to use a specific design pattern, framework or abstraction.

Robert Engels

unread,

Apr 18, 2014, 11:58:57 AM4/18/14

to mechanica...@googlegroups.com

The library is not unknown. It is widely used, actively maintained, and the source code is available. The java library is proprietary at this time, sorry, so you would just have to take my word for it.

As to your comments on the disrupter - I don't follow. Just have a worker send a 1k UDP packet (even a statically allocated one), in response to every event (using the included sample benchmarks). You will see that the overhead of message passing (which is what the disrupter is an attempt to improve) becomes negligible, and the standard Java version is within 1% of the 'disrupter' version.

And your cloud comment is nonsense. It's a prime reason for the success of 'Ruby On Rails'. You know how that optimize? Add another server... (because it's dog shit slow), and it is currently the most popular cloud based infrastructure component.

Gil Tene

unread,

Apr 18, 2014, 12:04:54 PM4/18/14

to <mechanical-sympathy@googlegroups.com>

Sent from my iPad

On Apr 18, 2014, at 8:45 AM, "Robert Engels" <ren...@ix.netcom.com> wrote:

I am slightly confused on

It doesn't matter how the network traffic gets to/from memory from/to the NICs. If any core in your socket interacts with the data in the network traffic at any point, that data would be moving in and out if the L3, evicting other cold data as it goes. That's why with an LRU cache, cold data only survives on idle sockets (ones that don't bring anything into their L3).

Imagine the case where you only had a single buffer of 1k, and that ALL network traffic went through (if you didn't process the packet fast enough it was dropped). How would this destroy the cache? Isn't the buffer going to be at a fixed location in main memory and retain the same physical address mapped in the cache, so at most reading/writing to this buffer could only destroy 1k of the cache ???

The sum of network buffers in the kernel are usually much larger than the L3 cache. And even the set of in flight, not yet consumed and recycleable ones tend to be pretty big. Most high performance stacks end up using some sort of streaming circular buffer or chained ring buffer scheme, and virtually all network drivers use ring buffer chains to communicate with NICs. The size of that ring data is often larger than L3.

Gil Tene

unread,

Apr 18, 2014, 12:04:57 PM4/18/14

to <mechanical-sympathy@googlegroups.com>

Robert,

You are clearly not a real Troll, and your questions and technical material seems to have much non-troll thinking behind it. But calling other people's hard work marketing crap is not a good way to demonstrate non-trollness... There are many people with different opinions here, and many of them can be right, at the same time, even when they disagree. It's when mud gets slung around that things get messy.

From experience, I can say that both the LMAX disruptor and kernel bypass stacks work extremely well in very real applications. And they both beat the pants off of classic alternatives when set up well. But both are examples of working really well when set up right, and being seemingly disappointing when people don't understand how to set them up. A great example of where mechanical sympathy is important.

E.g. In both the disruptor and kernel bypass cases, dedicating an always spinning thread makes a world if difference to real world behavior. And pinning that thread to a core takes away all sort of problems. But when people try to use them in non-spinning modes, or on over subscribed systems that end up taking away a whole scheduler quantum away from a thing that spends on constant spinning for good behavior, they often draw wrong conclusions about behavior of a correctly art up system.

Sent from my iPad

Robert Engels

unread,

Apr 18, 2014, 12:18:28 PM4/18/14

to <mechanical-sympathy@googlegroups.com>

Calling someone a troll is often used by some people to ward off valid criticism. I only responded in kind. I am sorry that I let it get the best of me.

I stand by my claims on both the disrupter and the kernel by-pass - at least in accordance to using them in ways considered valid and appropriate according to their marketing - in real-world applications, and even in isolated tests using their provided performance test cases.

-----Original Message-----
From: Gil Tene

Rüdiger Möller

unread,

Apr 18, 2014, 1:18:33 PM4/18/14

to mechanica...@googlegroups.com, Robert Engels

You should add some self-doubt to your judgement as you probably miss the point :-)

Increase your send rate to at least 200.000 msg/second, then decode and process/dispatch those messages. You'll fail to do this single threaded, so you have to go parallel while still maintaining serial order of processing. Try to do this without pipelining.

Though we use our own flavour of pipelining (lack of understanding the problem domain upfront) I found the disruptor to be a very clean and well performing implementation of that pattern.

To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Rüdiger Möller

unread,

Apr 18, 2014, 1:26:23 PM4/18/14

to mechanica...@googlegroups.com, Robert Engels

Am Freitag, 18. April 2014 17:58:57 UTC+2 schrieb Robert Engels:

And your cloud comment is nonsense. It's a prime reason for the success of 'Ruby On Rails'. You know how that optimize? Add another server... (because it's dog shit slow), and it is currently the most popular cloud based infrastructure component

So they trade development cost against operational cost. This has broken the neck of startups, especially if you need to serve 10k free users per single paying customer (low conversion rate).

But agree performance is not the major issue for many systems. For exchange systems it is an issue.

Robert Engels

unread,

Apr 18, 2014, 1:45:35 PM4/18/14

to mechanica...@googlegroups.com

I think you'll find that for all but the largest of users, the infrastructure costs of adding another server(s) to double performance is far cheaper than hiring a developer(s) to do the same...

But as you said, the exchange space needs to be orders of magnitude faster than 'e-commerce', and the serial processing required for order book management put constraints on the ability to just throw hardware at it. But if you look at what the CME has done in their latest infrastructure roll-out that is exactly what they've done... (by partitioning by product, and some custom nic's in the order gateways).

Again, bringing this back to where I started, the solution being offered seems to tend towards socket isolation, which gets very expensive, very fast, when talking about co-located servers.

-----Original Message-----
From: Rüdiger Möller
Sent: Apr 18, 2014 12:26 PM
To: mechanica...@googlegroups.com
Cc: Robert Engels
Subject: Re: the real latency performance killer

--

You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.

To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Rüdiger Möller

unread,

Apr 18, 2014, 3:22:27 PM4/18/14

to mechanica...@googlegroups.com, Robert Engels

Am Freitag, 18. April 2014 19:45:35 UTC+2 schrieb Robert Engels:

I think you'll find that for all but the largest of users, the infrastructure costs of adding another server(s) to double performance is far cheaper than hiring a developer(s) to do the same...

Sometimes building a highly optimized abstraction layer relaxes the requirements for business logic and therefore reduces overall development cost as implementation/change of business logic becomes straight forward and can be done cheap without deep knowledge of the overall system.

Also development cost are kind of mostly one-time effort. Operational cost is permanent. Even in e-commerce/internet apps not everything scales horizontally ..

The internet app market is still in "innovation beats implementation" market phase. I'd predict conversion rates will get lower and lower over time to the point where operational cost and software efficiency represent a significant survival factor.

Robert Engels

unread,

Apr 18, 2014, 3:22:30 PM4/18/14

to mechanica...@googlegroups.com

I've been given this some more thought, and given that the network buffers size is certainly greater than L3 cache, wouldn't it be better for the NIC to write directly to isolated main memory (bypassing the cache) when queuing the incoming packet, and then the kernel perform a memcpy to a constant "processing buffer in the cache", maybe one per open socket, only affecting the cache then? Otherwise it would seem that on even decently fast networks (even 1gb), the network traffic alone makes the L3 cache useless.

It's just a thought... It just seems strange that you can conceivably send a packet (rare, low frequency) across a high speed network to an idle machine (with intact cache), which then in turn acts on it, and get far better performance than trying to do the work on the machine receiving the actual data. Seems ripe for some sort of better partitioning scheme (although as you stated, you would probably get the best performance by just processing the request on an isolated socket).

Martin Thompson

unread,

Apr 18, 2014, 4:03:51 PM4/18/14

to mechanica...@googlegroups.com

On 18 April 2014 20:22, Robert Engels <ren...@ix.netcom.com> wrote:

I've been given this some more thought, and given that the network buffers size is certainly greater than L3 cache, wouldn't it be better for the NIC to write directly to isolated main memory (bypassing the cache) when queuing the incoming packet, and then the kernel perform a memcpy to a constant "processing buffer in the cache", maybe one per open socket, only affecting the cache then? Otherwise it would seem that on even decently fast networks (even 1gb), the network traffic alone makes the L3 cache useless.

It's just a thought... It just seems strange that you can conceivably send a packet (rare, low frequency) across a high speed network to an idle machine (with intact cache), which then in turn acts on it, and get far better performance than trying to do the work on the machine receiving the actual data. Seems ripe for some sort of better partitioning scheme (although as you stated, you would probably get the best performance by just processing the request on an isolated socket).

I think it does not matter so much what we would like to happen, we have to face what the current situation is.

On Intel servers, since Sandy Bridge, the PCI-e controller is connected to the same on-chip ring bus as the L3 cache segments and the memory controller. DDIO[1] was introduced so that incoming network traffic is addressed directly to the L3 cache. The main reason for this is power saving. Most hardware innovation is now focused on power saving. When data lands in that L3 cache segment, then the data in L1/L2 caches can be evicted if it clashes on ways.

To achieve your goal of low-latency and low-frequency it is possible with some mechanical sympathy.

If you have say a 2 socket server. Setup up ISOCPUS to isolate the second socket. Then all traffic and OS needs are serviced by the first socket. Memory can also be bound by socket. The second socket can run your app and receive the incoming traffic over IPC using the QPI as a very fast on-server network.

Those kernel bypass network stacks that you so dislike can run on the first socket saving you a lot of latency. I've first hand measurements of them halving latency between machines. Then a IPC mechanism using spinning techniques like the Disruptor can be employed to communicate between the sockets at under 100ns latency with the data in ring buffers that hide the latency with prefetchng and as a result of understanding cachelines.

If you follow such a design you can achieve your goals and be MUCH faster than the "standard" approaches. This is mechanical sympathy in action and not technical masturbation.

[1] - http://www.intel.com/content/www/us/en/io/direct-data-i-o.html

Robert Engels

unread,

Apr 18, 2014, 4:22:00 PM4/18/14

to Rüdiger Möller, mechanica...@googlegroups.com, Robert Engels

I mostly agree, but I know consultants that have been on the same project for 10+ years... let alone the developers that were hired full-time.

I think design simplicity solves most of the ills - somewhere people forgot the adage that premature optimization is the root of all evil - and allows developers to be productive in reacting to market demand, and simultaneously allowing for easy horizontal and/or vertical scalability.

-----Original Message-----
From: Rüdiger Möller
Sent: Apr 18, 2014 2:22 PM
To: mechanica...@googlegroups.com
Cc: Robert Engels

Subject: Re: the real latency performance killer

Robert Engels

unread,

Apr 18, 2014, 4:22:36 PM4/18/14

to Rüdiger Möller, mechanica...@googlegroups.com, Robert Engels

Robert Engels

unread,

Apr 18, 2014, 4:35:00 PM4/18/14

to mechanica...@googlegroups.com

This is almost exactly as what we do now, the problem arises when the single socket does not have enough processing power to handle all of the high-frequency traffic. You can't move some of it to the other socket, because then you destroy the cache for the low-frequency process.

And a quick-point on the kernel bypass... I made my assessment based on real-world, live performance tests under identical conditions. Now, could I possibly have reconfigured the cpu isolation, and cpu assignments, interrupt assignments, to maybe make it perform better? Possibly. Then again, when you have hundreds of customers, with different constantly changing work-loads, and different hardware and OS configurations, at some point you need solutions that work "mostly out of the box", otherwise you will eat up profits chasing dragons (especially when so many of the factors contributing to latency are out of your control).

-----Original Message-----
From: Martin Thompson

Sent: Apr 18, 2014 3:03 PM
To: mechanica...@googlegroups.com
Subject: Re: the real latency performance killer

--

You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.

To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Martin Thompson

unread,

Apr 18, 2014, 5:07:16 PM4/18/14

to mechanica...@googlegroups.com

I cannot comment for your setup or the coding capabilities of your developers. What I can say is you can buy some very powerful dual socket servers these days and that is only getting better. Intel now have 15 core Ivy Bridge chips, so two of those is a lot of CPU horsepower.

There is also a lot to be said for having mechanical sympathy and writing good single threaded code. Amazing things can be achieved on a single thread when you understand the basics of memory subsystems. I've never yet meet a feed I cannot ingest with good coding practices and some reasonable hardware.

No one ever said developing quality software was easy. I know so few people that really excel at it. Part of that is how we train people and expectations. All the great developers I've meet had great mechanical sympathy and wrote code that was so readable, the complete opposite of what people *believe* high-performance code looks like. If you analyse the complexity in any major software stack it could be argued that software is one of the most complex endeavours we undertake. The challenge is working out where we need to focus as all our capabilities are limited.

--

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Nitsan Wakart

unread,

Apr 18, 2014, 5:09:05 PM4/18/14

to mechanica...@googlegroups.com

> People will eventually figure out that using higher level abstractions will lead to better performance when they let go of their egos.

I fail to see how meditating on how the JVM/hardware/OS is going to solve all my problems in just a few years will help in solving a performance issue today. Preaching people to let go of their ego is all very Zen, but this sounds more like a recipe to let go of you job ;-)

Robert Engels

unread,

Apr 18, 2014, 5:41:02 PM4/18/14

to mechanica...@googlegroups.com

I think I might no be presenting that correctly... If you focus on highly readable, easily maintainable code, it is easy to fix almost all performance problems. You'd be amazed at the number of people that write O(n2) algos that are buried. Write clean code and a competent developer can easily find and fix these mistakes.

Take java itself. I can write a program today and run it on Java 1 and without changing a line of code, or the hardware, execute it on Java 8 10-100 times faster. This is the benefit on abstractions.

Nitsan Wakart

unread,

Apr 18, 2014, 6:29:05 PM4/18/14

to mechanica...@googlegroups.com

"Take java itself. I can write a program today and run it on Java 1 and without changing a line of code, or the hardware, execute it on Java 8 10-100 times faster. This is the benefit on abstractions."

This is the benefit of hindsight, not abstractions. If I was employing you to meet a requirement with Java 1, I would probably not want to sit on that beautifully abstracted code for 20 years and wait for Java 8 to meet the performance requirement.
A
 problem that needs solving today is not helped by the miracles of the future, not even if you predict them. 

robert engels

unread,

Apr 18, 2014, 6:41:46 PM4/18/14

to mechanica...@googlegroups.com

It was clearly an exaggeration, but I have experienced multiple times in my career when dealing with large scale enterprise application that takes years to develop and deploy (with business requirements changing all the time), that Moore's law holds, and that by the time you are releasing many times what was a performance bottleneck is no more - and by focusing on the flexibility of the design first, you aren't stuck with an outdated application by the time it is released.

robert engels

unread,

Apr 18, 2014, 6:52:46 PM4/18/14

to mechanica...@googlegroups.com

Also, Twitter was written originally in RoR.... It is far better to be adaptable and performant enough, than super performant and not highly adaptable...

There is only a very small class of problems that will ever need this level of attention at the application layer, HPC the most important, HFT is going away... Outside of these classes, most problems are caused by poor algorithm/data structure choices, or IO limitations.

On Apr 18, 2014, at 5:29 PM, Nitsan Wakart wrote:

Nitsan Wakart

unread,

Apr 18, 2014, 7:26:30 PM4/18/14

to mechanica...@googlegroups.com

"It is far better to be adaptable and performant enough, than super performant and not highly adaptable..."

Enough is the key word... so this is a dressed up "premature == evil" argument.
Let's not go there again.
Perhaps we need a separate "we think people worry too much about performance" mailing list.

robert engels

unread,

Apr 18, 2014, 7:29:00 PM4/18/14

to mechanica...@googlegroups.com

Yea, that Knuth guy didn't really know what he was talking about...

Jason Koch

unread,

Apr 18, 2014, 8:03:36 PM4/18/14

to mechanica...@googlegroups.com

On 19 Apr 2014, at 8:41 am, robert engels <ren...@ix.netcom.com> wrote:

It was clearly an exaggeration, but I have experienced multiple times in my career when dealing with large scale enterprise application that takes years to develop and deploy (with business requirements changing all the time), that Moore's law holds, and that by the time you are releasing many times what was a performance bottleneck is no more - and by focusing on the flexibility of the design first, you aren't stuck with an outdated application by the time it is released.

Moore's law doesn't help you replace existing hardware without capital expense, lead time, outages etc on a production system. Waiting for faster hardware doesn't help if your hardware is already deployed.

I have experienced many times a business constrained by the performance of its systems and unable to quickly enough add hardware to release new product, sell more, or add features expected by the market, or worse to deal with new regulatory requirements.

Waiting for faster or even just more hardware is not an option in many real environments and paying attention to performance can have significant real benefits to business even outside the trading world.

robert engels

unread,

Apr 18, 2014, 8:29:19 PM4/18/14

to mechanica...@googlegroups.com

If this is true:

I have experienced many times a business constrained by the performance of its systems and unable to quickly enough add hardware to release new product, sell more, or add features expected by the market, or worse to deal with new regulatory requirements.

you have an underlying design/architecture problem, not a performance problem.

In any complex enterprise system, the techniques being discussed in this forum will not apply (or be a minuscule performance improvement) - there is just too much cross talk, huge data sets, tons of IO, in a complex app with changing requirements, and multiple teams/products etc. that you cannot control the system at the level needed for them to apply.

They apply well in micro-benchmarks, and highly specialized HPC code. This is a very, very small portion of the application space, especially when discussing business/commerce apps (which is where it seems most of the work is being done).

All that being said, it is always better to write more efficient code, but EXTREMELY RARELY at the expense of maintainability.

We have people from TOP schools that apply at our company that can't write a simple sort of integers (even a bubble sort) during an interview that works.

So again I'll say, it might make a lot of people feel better, and think it provides them lots of job security if they can write the utmost nuanced code correctly, and eek out some performance gains, but the rest of the world will be using Ruby on Rails, writing dog shit slow code, getting it to market, and making billions...

You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Nitsan Wakart

unread,

Apr 18, 2014, 8:29:57 PM4/18/14

to mechanica...@googlegroups.com

He did, and he said it, and we all know it... why re-hash the obvious? you need to prove the 'premature' bit for the argument to catch.

robert engels

unread,

Apr 18, 2014, 8:45:15 PM4/18/14

to mechanica...@googlegroups.com

You know that depends on the use case...

If I have an operation that takes 1 sec to complete and it's done 3 times a day, and it's got really clean well designed code, and is easy to maintain, am I going to hack it to make it work 20% faster??? I would hope not.

Take that same app, and it has a job that takes 10 hours to complete, and it runs overnight (or in the background). Am I going to hack it to pieces to make it run in 8 hrs ? Unless it is a COMPLETELY O(N2) shit design, you aren't going to make the 10 hr job complete in 10 seconds, no matter what coding changes you make...

The worst is to design for the 10 seconds at the start and realize that no matter what you do it's going to take 8 hrs, now you have the complex code to maintain, and good luck scaling out, etc.

That is why I make reference to the 'Disrupter' as it's really easy to show that it doesn't matter... The test cases are well written, and it includes comparison tests with "more standard" implementations. Just make the worker do something extremely simple like send a multicast udp packet. This operation is orders of magnitude slower than the message/queue processing overhead, so these "performance improvements" quickly go out the window...

Rüdiger Möller

unread,

Apr 18, 2014, 9:37:14 PM4/18/14

to

Am Samstag, 19. April 2014 02:45:15 UTC+2 schrieb Robert Engels:

Take that same app, and it has a job that takes 10 hours to complete, and it runs overnight (or in the background). Am I going to hack it to pieces to make it run in 8 hrs ? Unless it is a COMPLETELY O(N2) shit design, you aren't going to make the 10 hr job complete in 10 seconds, no matter what coding changes you make...

The batch might crash after 9'50 and you can't open market cause your data is dirty (it happened).

I'd say "The root of all evil is hindsight optimization". I have seen this more often: half baked caches are planted all over the application and the whole architecture is workarounded hastily within 2 days in order to meet requirements.

Rüdiger Möller

unread,

Apr 18, 2014, 9:12:56 PM4/18/14

to mechanica...@googlegroups.com

Am Samstag, 19. April 2014 00:41:46 UTC+2 schrieb Robert Engels:

It was clearly an exaggeration, but I have experienced multiple times in my career when dealing with large scale enterprise application that takes years to develop and deploy (with business requirements changing all the time), that Moore's law holds, and that by the time you are releasing many times what was a performance bottleneck is no more - and by focusing on the flexibility of the design first, you aren't stuck with an outdated application by the time it is released.

Really ? From my experience the existance of "performance bottleneck" drops into a project's minds like 4 weeks before production. Same applies to the idea there might be more than 10 concurrent users :-)

robert engels

unread,

Apr 18, 2014, 9:36:56 PM4/18/14

to mechanica...@googlegroups.com

The emphasis didn't come out quite right, (too much on moore's law), as the design flexibility (along with proven patterns to common problems) is just as important. You also need to architect a system in a macro sense for performance from the start, that way you know ahead of time where your escape hatches are, and why you have multiple, so that as requirements shift, you still have options.

Jason Koch

unread,

Apr 18, 2014, 10:15:07 PM4/18/14

to mechanica...@googlegroups.com

>
> If this is true:
>
>> I have experienced many times a business constrained by the performance of its systems and unable to quickly enough add hardware to release new product, sell more, or add features expected by the market, or worse to deal with new regulatory requirements.
>
> you have an underlying design/architecture problem, not a performance problem.
>
> In any complex enterprise system, the techniques being discussed in this forum will not apply (or be a minuscule performance improvement) - there is just too much cross talk, huge data sets, tons of IO, in a complex app with changing requirements, and multiple teams/products etc. that you cannot control the system at the level needed for them to apply.
>

Batching, caching, queuing, latency, synchronisation, throughput, benchmarking technique, single writer, etc are actually all architecture and design approaches/considerations which are then implemented at multiple layers in a system. Whether it is inter-thread messaging or b2b communications is just a matter of scale and level of abstraction.

robert engels

unread,

Apr 18, 2014, 10:28:35 PM4/18/14

to mechanica...@googlegroups.com

Agreed, but very little of that "goes to the metal" and if you screw up the macro design it won't matter what you do at the metal...

Sent from my iPad

Nitsan Wakart

unread,

Apr 19, 2014, 3:28:08 AM4/19/14

to mechanica...@googlegroups.com

To be blunt let me highlight the noise to signal ratio in your last comment:

On Apr 19, 2014, at 2:45 AM, robert engels <ren...@ix.netcom.com> wrote:

You know that depends on the use case...

If I have an operation that takes 1 sec to complete and it's done 3 times a day, and it's got really clean well designed code, and is easy to maintain, am I going to hack it to make it work 20% faster??? I would hope not.

Take that same app, and it has a job that takes 10 hours to complete, and it runs overnight (or in the background). Am I going to hack it to pieces to make it run in 8 hrs ? Unless it is a COMPLETELY O(N2) shit design, you aren't going to make the 10 hr job complete in 10 seconds, no matter what coding changes you make...

The worst is to design for the 10 seconds at the start and realize that no matter what you do it's going to take 8 hrs, now you have the complex code to maintain, and good luck scaling out, etc.

Blah blah blah- premature == evil

That is why I make reference to the 'Disrupter' as it's really easy to show that it doesn't matter... The test cases are well written, and it includes comparison tests with "more standard" implementations. Just make the worker do something extremely simple like send a multicast udp packet. This operation is orders of magnitude slower than the message/queue processing overhead, so these "performance improvements" quickly go out the window...

You show little understanding and no respect to others with that latest remark. I know the Disruptor was developed to answer a need, they tried the JDK first, on the real system not a benchmark, and it failed to meet their needs. Many people on this list, and elsewhere, use the Disruptor or similar approaches yet you seem to think they are all fashion victims. I have personally worked on or witnessed enough systems that see the benefit, and you offer me no PROOF to the contrary...

Robert Engels

unread,

Apr 19, 2014, 9:07:22 AM4/19/14

to mechanica...@googlegroups.com

I will post the more realistic test cases next week when I'm in the office.

I will also point to the fact that LMAX needed to move to Azul Zing anyway because the approach of no garbage java just doesn't work for all but the most trivial systems.

Gil Tene

unread,

Apr 19, 2014, 10:37:04 AM4/19/14

to <mechanical-sympathy@googlegroups.com>

Sent from my iPad

On Apr 19, 2014, at 6:07 AM, "Robert Engels" <ren...@ix.netcom.com> wrote:

I will post the more realistic test cases next week when I'm in the office.

I will also point to the fact that LMAX needed to move to Azul Zing anyway because the approach of no garbage java just doesn't work for all but the most trivial systems.

The disruptor and Zing address two very different things.

Zing's main value to responsiveness is in making the JVM continually reactive (JVM noise levels fall to below the OS noise levels). It does so primarily by making GC work right. As in "work without disrupting your application". The way we usually compare Zing with other alternatives is by looking at the behavior of latency percentiles. We expect little difference in the common case behavior, and huge differences in "outliers".

The Disruptor provides a low latency, high throughout execution pattern whose very real benefit can be measured. When comparing the disruptor with alternatives, I usually look at two things: common case latency, and achievable throughput levels at acceptable latency. My experience is that the disruptor tends to do much better on both of these metrics (i.e. lower common case latency, and higher achievable throughput) when compared to using stock queues, or others forms of multiplexed I/O or executor threads. I've certainly seen things that do even better than the disruptor, but this far they have been hand crafted.

The disruptor has nothing to do no no-garbage Java or GC pause, and Zing has nothing to do with how fast your event or message processing code paths are. The two are complementary. That's why we see many people that use Zing also use a disruptor.

BTW, an obvious side effect solving the GC problem at the JVM level is that outliers are no longer dependent in whether or not you allocate stuff, or on how much stuff you keep in the heap. So you can use idiomatic Java, or a no-GC approach variants, or anywhere in between: with Zing outlier behavior will be the same for all. You can also keep things in memory that can help you go fast. However, [common case] speed is still determined by what your code does, and allocating objects for no good reason, or wasting time executing inefficient code paths, or making is still going to affect your common case latencies.

I do find that [with the outlier problem solved] re-introducing some normal Java allocation to low latency code paths sometimes helps improve their speed, but this is usually when the alternative was allocating stuff in some other way [e.g. object pools]. Using normal allocation patterns and regular objects also makes things simpler, more productive, and easier to debug [e.g. compared to object pools, other malloc/free variants, arenas, or off heap "objects"], especially because you can use regular java code (including other peoples code), and don't have to deal with "free".

You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Robert Engels

unread,

Apr 19, 2014, 10:46:39 AM4/19/14

to mechanica...@googlegroups.com

I would disagree in the no GC focus of the disrupter based on their own blog posts. Of course needless garbage generation is never appropriate, but I will argue that attempting no garbage is even worse...

Martin Thompson

unread,

Apr 19, 2014, 11:12:50 AM4/19/14

to mechanica...@googlegroups.com

On 19 April 2014 14:07, Robert Engels <ren...@ix.netcom.com> wrote:

I will post the more realistic test cases next week when I'm in the office.

I will also point to the fact that LMAX needed to move to Azul Zing anyway because the approach of no garbage Java just doesn't work for all but the most trivial systems.

Have you worked at LMAX? Do have any idea what the codebase looks like? I cannot remember you on the staff during my 5 years.

The vast majority of code when I was the CTO was very regular looking Java. We just ran a tight ship keeping things efficient and not being wasteful. The Disruptor was introduced to solve a problem of how do we run things in parallel with very low coordination costs. Those flows were a graph of dependencies that regular JDK queues proved far too inefficient for what we need to achieve business wise. This was all covered in the many public presentations we did. This technical PR was to help us efficiently recruit good people and the evidence shows that was a success.

With a well written codebase that uses allocation where appropriate, and not wastefully, I've found that adding Azul Zing into the mix results in really nice predictable latency profile. It also comes with a excellent tool chain to monitor and profile your application to facilitate the rapid identification of bottlenecks. This is totally unrelated to the Disruptor.

You say the benchmarks for the Disruptor are flawed. Well they are public. Please point out the mistakes we made and what we should have done better.

You have complained a lot in this thread but seem to offer nothing back of substance. You cite "abstraction" many times but never have you provided an example of good abstraction that illustrates you understand the concept.

As an illustration of abstraction and mechanical sympathy, I'd like to cite the Linux device driver model. We are all spared the gory details of how most devices work because devices are abstracted to be one of 3 types. Those types are character based, block based, and network interfaces. If you do not understand these basic abstractions then your code will suffer in more ways than performance. I see mechanical sympathy at the level of understanding these and other basic abstractions. I think this is responsible and it is irresponsible as a developer to not understand such basic fundamentals. You see recent posts in this group as "technical masturbation". There is a good expression from other disciplines, "A bad workman always blames is tools". You seem to have failed to use tools that many others have been successful with.

So many crimes in development have been created by people who introduce abstractions that are not representative. To be representative requires mechanical sympathy because all technology runs on hardware eventually.

Some of the people in this group study the platform to achieve mechanical sympathy. To embarrass Gil for a second. If he did not study the Linux virtual memory system to the level he did then we would not have such a great GC mechanism as afforded by Zing. I hope the likes of Gil continue their vigorous technical masturbation. :-)

Martin...

Robert Engels

unread,

Apr 19, 2014, 11:30:41 AM4/19/14

to mechanica...@googlegroups.com

Isn't this your employees blog? Do you read it? http://mechanitis.blogspot.com/2011/06/dissecting-disruptor-whats-so-special.html?m=1

I am only referring to the open sourced disrupter code.

And what you state is a load of nonsense. You provide benchmarks and claim 55 million oops a sec, and it is essentially adding a long. My point was add in real work or any io and your techniques provide a marginal benefit over standard solutions. A system without io doesn't exist...

When you take these things into account you'll realize that horizontal scaling makes any benefits the framework provides for enterprise class systems irrelevant. See Google...

Don't get me wrong, we use many of the same internal techniques as the disrupter, but we're in a very specialized space with very uncommon constraints.

Selling mechanical sympathy to the world is selling snake oil. Or selling guns to children. And its a bit disgusting.

Gil Tene

unread,

Apr 19, 2014, 11:44:24 AM4/19/14

to <mechanical-sympathy@googlegroups.com>

My HdrHistogram code does zero allocation in the instrumentation and iteration paths. And I'm the guy who came up with a Pauseless collector that actually works. And who advocates for using idiomatic Java and for dispensing with zero-GC practices because they are no longer needed. And there is no contradiction.

Using HdrHistogram works great with idiomatic Java, but it's ability to be used even by people who do zero GC in their own code, without breaking their assumptions, makes it much more useable as a tool for third parties. Since HdrHistogram is often used to record latency behavior in super-latency-sensitive systems, being able to fit into someone else's world view was an important design consideration.

Similarly, the disruptor is a tool other people use in their own systems, and many of its users are very latency sensitive. The fact that the disruptor itself successfully avoids allocation has no bearing on the coding choices of the rest of the code around it. It does not dictate a no-GC or off heap way of doing things, and works just as well with or without one. But the fact that it can drop into those environments makes it much more useable by third parties.

Sent from my iPad

Martin Thompson

unread,

Apr 19, 2014, 11:45:17 AM4/19/14

to mechanica...@googlegroups.com

On 19 April 2014 16:30, Robert Engels <ren...@ix.netcom.com> wrote:

Isn't this your employees blog? Do you read it? http://mechanitis.blogspot.com/2011/06/dissecting-disruptor-whats-so-special.html?m=1

I I have read this and know Trish well. What part of this article illustrates why LMAX moved to Zing because we could not write garbage free code as you stated?

And what you state is a load of nonsense. You provide benchmarks and claim 55 million oops a sec, and it is essentially adding a long. My point was add in real work or any io and your techniques provide a marginal benefit over standard solutions. A system without io doesn't exist...

The point of the test is to exercise the concurrency model under contention. Before making unfounded statements you should read up on Amdahl's Law and Universal Scalability Law to understand why keeping the contention and coherence cost low is so important. If you studied science you know the importance of performing a clean experiment that is isolated from noise.

When you take these things into account you'll realize that horizontal scaling makes any benefits the framework provides for enterprise class systems irrelevant. See Google...

Horizontal scaling is a valid option when you can do it. Google have probably written more proprietary code than most to achieve what they have. This takes a great deal of study, experimentation, and mechanical sympathy. Should they just have sat around and waited for hardware to catch up?

Don't get me wrong, we use many of the same internal techniques as the disrupter, but we're in a very specialized space with very uncommon constraints.

Selling mechanical sympathy to the world is selling snake oil. Or selling guns to children. And its a bit disgusting.

As you have said yourself you have not made these techniques work. Many others have. If you cannot use a tool that many others have then the source of the issue is pretty obvious.

Such an emotive response with no substance again.

Rüdiger Möller

unread,

Apr 19, 2014, 12:22:04 PM4/19/14

to

The only "critical" feedback to the disrupter from my side would be, that the documentation starts all over in a technical/implementation oriented way, so many people misunderstand the disruptor being somehow a magical "ringbuffer". As soon I realized its basically an "assembly line" pattern to coordinate concurrent work without the need to enqueue stuff multiple times, everything (including the role of interfaces+classes) started to clear up immediately. Maybe a one liner in documentation can fix this :-)

Robert Engels

unread,

Apr 19, 2014, 12:24:33 PM4/19/14

to mechanica...@googlegroups.com

On Saturday, April 19, 2014 10:45:17 AM UTC-5, Martin Thompson wrote:

On 19 April 2014 16:30, Robert Engels <ren...@ix.netcom.com> wrote:

Isn't this your employees blog? Do you read it? http://mechanitis.blogspot.com/2011/06/dissecting-disruptor-whats-so-special.html?m=1

I I have read this and know Trish well. What part of this article illustrates why LMAX moved to Zing because we could not write garbage free code as you stated?

From Micheal Barker, in his comments, "As the amount of memory used by the system remains static is reduces the frequency of garbage collection."

If you have a static memory system there is no garbage collection by definition. Seriously, did you read it?

I assume you decided to use Zing because you realized that doing what you were doing led to all sorts of issues for an exchange.. no 24 hr cycles due to reboots to clear garbage, too difficult to adapt to regulatory changes because writing code in this style is not productive, writing everything from scratch to ensure no garbage generation, unlimited order books require dynamic structures or at least pointer references, etc.

And what you state is a load of nonsense. You provide benchmarks and claim 55 million oops a sec, and it is essentially adding a long. My point was add in real work or any io and your techniques provide a marginal benefit over standard solutions. A system without io doesn't exist...

The point of the test is to exercise the concurrency model under contention. Before making unfounded statements you should read up on Amdahl's Law and Universal Scalability Law to understand why keeping the contention and coherence cost low is so important. If you studied science you know the importance of performing a clean experiment that is isolated from noise.

Your 55 millions ops claim comes from a single writer/reader test passing a single long - and this is a contention test??? I think you might be forgetting all the stuff you/your company has written...

Enough academic papers have been written that prove this micro benchmark fallacy. Here's one for your reading ftp://ftp.cs.cmu.edu/project/mach/doc/published/benchmark.ps because you're obviously confused here.

When you take these things into account you'll realize that horizontal scaling makes any benefits the framework provides for enterprise class systems irrelevant. See Google...

Horizontal scaling is a valid option when you can do it. Google have probably written more proprietary code than most to achieve what they have. This takes a great deal of study, experimentation, and mechanical sympathy. Should they just have sat around and waited for hardware to catch up?

Actually, a lots of Google guts are fairly standard, and the stuff that is general purpose that doesn't compete with them, they open-source. I have 4 proteges that work there in engineering now. You can also review the android source (although they didn't write most of that). Not a lot of mechanical sympathy there (outside of the Linux kernel).

Don't get me wrong, we use many of the same internal techniques as the disrupter, but we're in a very specialized space with very uncommon constraints.

Selling mechanical sympathy to the world is selling snake oil. Or selling guns to children. And its a bit disgusting.

As you have said yourself you have not made these techniques work. Many others have. If you cannot use a tool that many others have then the source of the issue is pretty obvious.

We make these techniques work. Very well, in fact in many cases much better than similar tests run using the Disrupter. The difference is that even within the company we hide these details from the upper layers so they don't have to deal with it. We don't attempt to promote this level of detail up through the company, nor certainly to the world. We hide our esoteric improvements behind abstractions so the upper layer people still work with well understood paradigms.

Such an emotive response with no substance again.

Again garbage. Like I said, I'll post the tests next week, and then maybe people will see your benchmarks for what they are... misleading.

Martin Thompson

unread,

Apr 19, 2014, 12:57:25 PM4/19/14

to mechanica...@googlegroups.com

On 19 April 2014 17:24, Robert Engels <ren...@ix.netcom.com> wrote:

On Saturday, April 19, 2014 10:45:17 AM UTC-5, Martin Thompson wrote:

On 19 April 2014 16:30, Robert Engels <ren...@ix.netcom.com> wrote:

Isn't this your employees blog? Do you read it? http://mechanitis.blogspot.com/2011/06/dissecting-disruptor-whats-so-special.html?m=1

I I have read this and know Trish well. What part of this article illustrates why LMAX moved to Zing because we could not write garbage free code as you stated?

From Micheal Barker, in his comments, "As the amount of memory used by the system remains static is reduces the frequency of garbage collection."

If you have a static memory system there is no garbage collection by definition. Seriously, did you read it?

I assume you decided to use Zing because you realized that doing what you were doing led to all sorts of issues for an exchange.. no 24 hr cycles due to reboots to clear garbage, too difficult to adapt to regulatory changes because writing code in this style is not productive, writing everything from scratch to ensure no garbage generation, unlimited order books require dynamic structures or at least pointer references, etc.

I think you "I assume" sums it up pretty well. Note the word in bold.

And what you state is a load of nonsense. You provide benchmarks and claim 55 million oops a sec, and it is essentially adding a long. My point was add in real work or any io and your techniques provide a marginal benefit over standard solutions. A system without io doesn't exist...

The point of the test is to exercise the concurrency model under contention. Before making unfounded statements you should read up on Amdahl's Law and Universal Scalability Law to understand why keeping the contention and coherence cost low is so important. If you studied science you know the importance of performing a clean experiment that is isolated from noise.

Your 55 millions ops claim comes from a single writer/reader test passing a single long - and this is a contention test??? I think you might be forgetting all the stuff you/your company has written...

Enough academic papers have been written that prove this micro benchmark fallacy. Here's one for your reading ftp://ftp.cs.cmu.edu/project/mach/doc/published/benchmark.ps because you're obviously confused her

I get no data at that link.

Let's be scientific and consider Amdahal's law or USL.

The key consideration to scale up when you have contention is identify the component of your algorithm that is serial and what are the coherence costs. In these benchmarks we focus on making the serial part as much of the total as possible. If you take your example of adding in doing some network traffic then you have diluted the experiment as it is dominated by network traffic.

Standard queues as found in the JDK have a larger serial component within the locks. They also have a higher coherence cost due to involving the operating system for the signalling on condition variables. On top of this they have a very variable latency profile due to involvement of the kernel with associated context switching.

If you have a lot of long running activities between infrequent interactions with a queue you have very little contention. However you will have a very high coherence cost due to the wakeup if not busy spinning.

When you take these things into account you'll realize that horizontal scaling makes any benefits the framework provides for enterprise class systems irrelevant. See Google...

Horizontal scaling is a valid option when you can do it. Google have probably written more proprietary code than most to achieve what they have. This takes a great deal of study, experimentation, and mechanical sympathy. Should they just have sat around and waited for hardware to catch up?

Actually, a lots of Google guts are fairly standard, and the stuff that is general purpose that doesn't compete with them, they open-source. I have 4 proteges that work there in engineering now. You can also review the android source (although they didn't write most of that). Not a lot of mechanical sympathy there (outside of the Linux kernel).

So much of their work focuses on latency. They moved from map-reduce to caffeine as a good example.

http://googleblog.blogspot.co.uk/2010/06/our-new-search-index-caffeine.html

Google design their own power supplies to be more efficient. They have modified the Linux kernel to work with faulty memory. Publish many research papers all show innovation at all levels. So much of what they do is about pushing the boundaries of machine sympathy to be more efficient.

You know 4 people at Google. I'm impressed.

Don't get me wrong, we use many of the same internal techniques as the disrupter, but we're in a very specialized space with very uncommon constraints.

Selling mechanical sympathy to the world is selling snake oil. Or selling guns to children. And its a bit disgusting.

As you have said yourself you have not made these techniques work. Many others have. If you cannot use a tool that many others have then the source of the issue is pretty obvious.

We make these techniques work. Very well, in fact in many cases much better than similar tests run using the Disrupter. The difference is that even within the company we hide these details from the upper layers so they don't have to deal with it. We don't attempt to promote this level of detail up through the company, nor certainly to the world. We hide our esoteric improvements behind abstractions so the upper layer people still work with well understood paradigms.

You said you failed at this now changing your story. Never mind your flip flopping on story, is very weak to attack work that is in the open, following scientific principles, and subject to peer review when you are not willing to do that yourself.

Such an emotive response with no substance again.

Again garbage. Like I said, I'll post the tests next week, and then maybe people will see your benchmarks for what they are... misleading.

I'll look forward to it.

Just out of curiosity, if you think understanding the platform is so pointless and "like selling guns to children" then why you are you here? Are you on a mission to teach us the error of our ways? If you have good science I'm willing to listen but if all you can do is sling mud I don't think many will think very highly of your approach.

Martin...

Robert Engels

unread,

Apr 19, 2014, 4:30:22 PM4/19/14

to mechanica...@googlegroups.com

Uncle.

It's like arguing the existence of god with a priest. You have too much at stake to be wrong, so when evidence is against you, you move the goal posts.

I've seen it for 30+ years... Why do you write in C when assembly is so much faster, then the same with C++, the holiest of wars with using Java, and now its even "Java with mechanical sympathy".

If you're smart enough to design the Disruptor you're smart enough to know it doesn't matter....

If you think the several hundred nanosecond improvement matters on a generalized OS, CPU, enterprise system, why don't you just hand code in assembly on specialized hardware with specialized cache controllers, etc. and really go for it - if it matters that much.

Because you are a charlatan selling snake oil.

I started this thread with a very specific problem/observations of low latency low frequency in the context of enterprise class system. Only Gil offered anything constructive in terms of full socket isolation (the continual warming of the code is very difficult).

Because I always do what I say, I will post the test cases next week, but you guys are a bat shit crazy cult and I'm gone.

Kirk Pepperdine

unread,

Apr 20, 2014, 12:50:29 AM4/20/14

to mechanica...@googlegroups.com

If you think the several hundred nanosecond improvement matters on a generalized OS, CPU, enterprise system, why don't you just hand code in assembly on specialized hardware with specialized cache controllers, etc. and really go for it - if it matters that much.

It mattered to Hitler, with those hundreds of nanoseconds he could have won.

Regards,

Kirk

Rüdiger Möller

unread,

Apr 20, 2014, 6:24:58 AM4/20/14

to mechanica...@googlegroups.com

Am Samstag, 19. April 2014 22:30:22 UTC+2 schrieb Robert Engels:

but you guys are a bat shit crazy cult and I'm gone.

He might have made a point here :-)

@Kirk: https://www.youtube.com/watch?v=yfl6Lu3xQW0&noredirect=1

Nitsan Wakart

unread,

Apr 20, 2014, 6:42:58 AM4/20/14

to mechanica...@googlegroups.com

+1 it's not proper flaming without Hitler

Kirk Pepperdine

unread,

Apr 20, 2014, 7:29:43 AM4/20/14

to mechanica...@googlegroups.com

now that… is just too funny!

Martin Thompson

unread,

Apr 20, 2014, 8:20:41 AM4/20/14

to mechanica...@googlegroups.com

Brilliant and timeless!

Robert Engels

unread,

Apr 21, 2014, 2:17:17 PM4/21/14

to mechanica...@googlegroups.com

Attached are the modified test cases from the open source Disruptor. They are basically identical to the included tests (except I modified the tests to always include the queue tests), and the iterations are reduced in the doRealWork case.

If you start the standard tests with

-DdoRealWork=true

it will do the 'real' work. Which in this case is to send a single datagram packet to the loopback adapter. No garbage generation outside of the DatagramPacket, which is fairly standard.

The results (multiple runs omitted for space). All tests run on same hardware, same OS, same Sun 1.7 JVM, same JVM options, same 3 gb heap.

OnePublisherToOneProcessorUniCastBatchThroughputTest (baseline)

Starting Queue tests

Run 0, BlockingQueue=4,780,800 ops/sec

Run 18, BlockingQueue=4,414,231 ops/sec

Run 19, BlockingQueue=4,411,116 ops/sec

Starting Disruptor tests

Run 0, Disruptor=157,480,314 ops/sec

Run 18, Disruptor=124,533,001 ops/sec

Run 19, Disruptor=124,378,109 ops/sec

Very impressive! 50 times faster !

OnePublisherToOneProcessorUnitCastBatchThroughputTest (real work)

Starting Queue tests

Run 0, BlockingQueue=105,485 ops/sec

Run 18, BlockingQueue=117,924 ops/sec

Run 19, BlockingQueue=117,924 ops/sec

Starting Disruptor tests

Run 0, Disruptor=145,560 ops/sec

Run 18, Disruptor=149,253 ops/sec

Run 19, Disruptor=148,809 ops/sec

Wow, still about 25% faster. Now lets increase the buffer size to something reasonable for 150k ops / sec (LMAX uses buffer sizes of 20 million entries in their main loops) by setting -DlargerBuffers=true

OnePublisherToOneProcessorUniCastBatchThroughputTest (real work with larger buffers)

Starting Queue tests

Run 0, BlockingQueue=131,406 ops/sec

Run 18, BlockingQueue=142,653 ops/sec

Run 19, BlockingQueue=142,450 ops/sec

Starting Disruptor tests

Run 0, Disruptor=144,717 ops/sec

Run 18, Disruptor=148,148 ops/sec

Run 19, Disruptor=148,588 ops/sec

Hmmm, now we are a meager 4% faster... a far cry from published 10-50 x reports...

And this is using really dumb off the shelf queues. Much of the difference here is probably the extra allocations being done as the items are added / removed from the linked blocking queue. A sane version would use a queue based on a ring buffer (with lock free synchronization) that is pre-allocated like The Disruptor.

But wait, lets go back to what started it all, the 55m ops per sec of the non-batch test out of the box.

OnePublisherToOneProcessorUniCastThroughputTest (standard)

Starting Queue tests

Run 0, BlockingQueue=4,407,810 ops/sec

Run 18, BlockingQueue=4,611,482 ops/sec

Run 19, BlockingQueue=4,598,335 ops/sec

Starting Disruptor tests

Run 0, Disruptor=43,917,435 ops/sec

Run 18, Disruptor=46,339,202 ops/sec

Run 19, Disruptor=46,189,376 ops/sec

Yep, 55m ops / sec and almost 10x faster than using queues. But wait, here is my "simple hand-off" doing the same work, with the same queue sizes.

SimpleHandOffTest

run 0, ops per second 45,808,520

run 18, ops per second 73,206,442

run 19, ops per second 73,855,243

Hmmm... 50% faster than The Disruptor...

The point is that micro-benchmarks just don't matter except in very specialized, isolated systems, and wasting your time chasing nanosecs (or writing highly specialized code) instead of focusing on the "holistic design" (as one commentator put it), is usually wasted effort.

Anyway, peace out.

On Thursday, April 17, 2014 11:46:26 AM UTC-5, Robert Engels wrote:

I was referred to this group by a colleague, and the participants are certainly more knowledgeable than myself, but I'd like to throw out my two cents anyway.

As a recent blog post of my showed, Java easily outperforms C++ in real-world tests, but even this test is flawed...

The problems is that even though this is a real-world test, doing "real" work, it is still essentially a micro-benchmark. Why?

Which brings me to the heart of the problem... Here are the memory access times on a typical modern processor:

Core i7 Xeon 5500 Series Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles remote
remote L3 CACHE ~100-300 cycles
Local Dram ~60 ns
Remote Dram ~100 ns

So with my "real-world" test, the heart of the code path is always in level 1 cache, with the predicative loading of the cache when the message object is retrieved.

Now, compare this with a true real-world application, with gigabytes of heap. Most modern processors have about 20 mb of shared level 3 cache, which is a fraction of the memory in use, so when the garbage collector is moving things around, and or background "house-keeping" tasks are doing their work, they are blowing out the CPU caches, (even the non-shared L2 cache is destroyed by a compacting garbage collector). Even isolated CPUs don't help with the latter.

So when your low-frequency, but low-latency (say sending an order in response to some market event), this code is going to run 5x (or more if NUMA is involved) slower than the micro-benchmark case due to the non-cached main memory access.

How do we fix this? 2 ways.

With CPU support for "non-cached reads and writes", a thread or (possibly a class/object) can be marked as "background", and then memory access by this thread/class do not go through the cache, hopefully preserving the L2/L3 cache for the "important" threads.

Similarly, an object/class marked "important" is a clue to the garbage collector to not move this object around if at all possible. This can sort of be solved now with off-heap memory structures, but they're are pain (at least in the current incarnation).

Without something similar to the above, I just don't think low-frequency and low-latency is possible.

modified_test_cases_to_be_more_realistic.patch

Michael Barker

unread,

Apr 21, 2014, 3:35:26 PM4/21/14

to mechanica...@googlegroups.com

Hi,

I can't get the patches to apply cleanly, they seem to be built against a quite out of date version of the Disruptor. The project layout changed significantly just prior to the release of 3.0 and these appear to be baselined against the older project layout. Any chance you could build them against the latest from Github. https://github.com/LMAX-Exchange/disruptor

Mike.

Rüdiger Möller

unread,

Apr 21, 2014, 3:37:30 PM4/21/14

to mechanica...@googlegroups.com

Robert,

I think you are missing the point: In many cases there is no "real work".

Example: an inmemory key value store will have work like "map.put". As most applications today work inmemory, business logic has become trivial and for many applications scheduling logic/cost is the major issue.

Second, you probably miss anther thing: Disruptor is more than a queue. Regarding your sample:

If done right you'd probably add another eventhandler doing (smart) batching to fill several responses into a datagram. You won't send a reply datagram directly from an incoming processing thread ever.

Another example of leveraging extreme low overhead of disruptors inter-core communication:

A simple service using one or more EventHandlers to decode incoming requests, one handler (=thread) to perform (mostly trivial) core business logic, one (or more) threads to encode the results again for network.

This wayi one can split work to several cores with very low overhead. If you try this using threadpools/JDK queues, you'll fail. Because of their inherent overhead, you need a lot of "real work" to make them scale. Practice has many examples where you hit the break even using disruptor easily, but not with alternative solutions.

e.g.

disruptor
                .handleEventsWith(decoders)
                .then(new EventHandler<TestRequest>() {
                    @Override
                    public void onEvent(TestRequest event, long sequence, boolean endOfBatch) throws Exception {
                        event.process(sharedData);
                    }
                })
                .handleEventsWith(encoders)

;

Robert Engels

unread,

Apr 21, 2014, 3:38:40 PM4/21/14

to mechanica...@googlegroups.com

I'll see if I can easily rebase off 3.0. It was 2.10.4.

-----Original Message-----
From: Michael Barker
Sent: Apr 21, 2014 2:35 PM
To: "mechanica...@googlegroups.com"
Subject: Re: the real latency performance killer

Hi,

I can't get the patches to apply cleanly, they seem to be built against a quite out of date version of the Disruptor. The project layout changed significantly just prior to the release of 3.0 and these appear to be baselined against the older project layout. Any chance you could build them against the latest from Github. https://github.com/LMAX-Exchange/disruptor

Mike.

On 22 April 2014 06:17, Robert Engels <ren...@ix.netcom.com> wrote:

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Robert Engels

unread,

Apr 21, 2014, 3:48:59 PM4/21/14

to mechanica...@googlegroups.com

inline comments...

On Monday, April 21, 2014 2:37:30 PM UTC-5, Rüdiger Möller wrote:

Robert,

I think you are missing the point: In many cases there is no "real work".

Example: an inmemory key value store will have work like "map.put". As most applications today work inmemory, business logic has become trivial and for many applications scheduling logic/cost is the major issue.

IMO, you would never send a trivial operation to another thread in a decent design.

I completely disagree that business logic has become trivial. I would like to see any research stating this.

Second, you probably miss anther thing: Disruptor is more than a queue. Regarding your sample:
If done right you'd probably add another eventhandler doing (smart) batching to fill several responses into a datagram. You won't send a reply datagram directly from an incoming processing thread ever.

Just not true in the HFT space. Every incoming market data event generates an outgoing packet in MANY applications. And outside that, things like software routers and such are 1 for 1.

Another example of leveraging extreme low overhead of disruptors inter-core communication:

A simple service using one or more EventHandlers to decode incoming requests, one handler (=thread) to perform (mostly trivial) core business logic, one (or more) threads to encode the results again for network.
This wayi one can split work to several cores with very low overhead. If you try this using threadpools/JDK queues, you'll fail. Because of their inherent overhead, you need a lot of "real work" to make them scale. Practice has many examples where you hit the break even using disruptor easily, but not with alternative solutions.

As I stated already, the queue used here are as dumb as possible. You can still use the 'queue design paradigm' and have them be far more efficient.

And as my last test showed (that I wrote in 30 secs with no optimization), you can go much faster when you start hard-coding for a specific use-case - but only a poor software engineer would take that approach (when if that level of performance mattered, you would use dedicated hardware, assembly language, etc.).

Robert Engels

unread,

Apr 21, 2014, 4:12:22 PM4/21/14

to mechanica...@googlegroups.com

The performance tests no longer run a comparison, but:

The OneToOneSequencedThroughputTest runs fine, and the numbers are 49 m ops/ sec. Almost identical to the 2.10 version.

When I run the OneToOneSequencedThroughputTest with 'real work', I get:

Starting Disruptor tests

Run 0, Disruptor=138,121 ops/sec

Run 1, Disruptor=148,588 ops/sec

Run 2, Disruptor=147,928 ops/sec

Run 3, Disruptor=147,275 ops/sec

Run 4, Disruptor=148,148 ops/sec

Run 5, Disruptor=147,710 ops/sec

Run 6, Disruptor=146,842 ops/sec

again, nearly identical to 2.10.

When I attempt to run the test (unmodified) OneToOneSequencedBatchThroughputTest it just hangs. So, it appears the batching code is broken and has concurrency issues.... hmmm.

I have attached the patch.

-----Original Message-----
From: Robert Engels
Sent: Apr 21, 2014 2:38 PM
To: mechanica...@googlegroups.com
Subject: Re: the real latency performance killer

I'll see if I can easily rebase off 3.0. It was 2.10.4.

-----Original Message-----
From: Michael Barker
Sent: Apr 21, 2014 2:35 PM
To: "mechanica...@googlegroups.com"
Subject: Re: the real latency performance killer

Hi,

I can't get the patches to apply cleanly, they seem to be built against a quite out of date version of the Disruptor. The project layout changed significantly just prior to the release of 3.0 and these appear to be baselined against the older project layout. Any chance you could build them against the latest from Github. https://github.com/LMAX-Exchange/disruptor

Mike.

On 22 April 2014 06:17, Robert Engels <ren...@ix.netcom.com> wrote:

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

changes_for_a_realistic_test.patch

Kirk Pepperdine

unread,

Apr 21, 2014, 4:24:36 PM4/21/14

to mechanica...@googlegroups.com

On Apr 21, 2014, at 10:12 PM, Robert Engels <ren...@ix.netcom.com> wrote:

The performance tests no longer run a comparison, but:

The OneToOneSequencedThroughputTest runs fine, and the numbers are 49 m ops/ sec. Almost identical to the 2.10 version.

When I run the OneToOneSequencedThroughputTest with 'real work', I get:

This statement suggests the bench is broken. Are you benchmarking ‘real work’ or are you benchmarking the framework. If the later than ‘real work’ is going to get into your way.

Regards,

Kirk

Michael Barker

unread,

Apr 21, 2014, 4:30:50 PM4/21/14

to mechanica...@googlegroups.com

When I attempt to run the test (unmodified) OneToOneSequencedBatchThroughputTest it just hangs. So, it appears the batching code is broken and has concurrency issues.... hmmm.

Not hung just taking a long time, I had been experimenting and left some bad numbers in the test, it was trying to do 200 000 000 000 operations per iteration. Fixed.