There are only a few cases where java will beat c++. Now, as you state in your blog, it's easier and faster to write a better performing java version of some algorithm, but there are many more optimization opportunities exposed in c++ than java. So for a skilled c++ developer who is sympathetic to machines, they'll most likely outpace java. Not to speak of amount of memory both servers will take. Also, java is commonly plenty fast, true - the issue most people fight against in low latency is unpredictability of GC. So both groups end up trying to avoid allocations: java guys with gc, c++ with malloc/new.
The cache issues you mention and generally the discrepancy between core and memory speed is primary reason java needs facilities to shrink footprint and stop chasing pointers.
Sent from my phone
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
I don't see how separate processes helps to isolate cache usage. What some folks do is flat out divvy up the machine: interrupts are masked out to run on a subset of cores, some processes are then affinitized to run on other cores, etc. Maybe that's what you meant, and it is a headache to maintain these setups.
Not sure which part you disagree with, but I'm sure we all agree that java could use a diet for data representation. Irrespective of other things, it'd be nice if more stuff fit in cache to begin with before we start worrying about cache misses and the like.
Sent from my phone
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
It's not that people think they're smarter than machines (although, that's true as well - people are the ones designing them, machines are just much quicker than humans). The issue is that as you go lower in the stack, things become more and more generalized. As a developer of some system, you typically know a lot more about data and its flow than anything lower than you. In those cases, you want more control because, well, you happen to know more about your usecase. The machine can pick up some patterns automatically (e.g. branch prediction, prefetch, etc), but they're going to be general "obvious" patterns.
The idea of having JVM do dynamic layout based on some cpu feedback has been brought up before, but this is a hard problem. What happens if workload changes? Are you going to re-layout everything? Leave it be? How is this data going to be collected and for what memory access? What is the perf implocation, cpu + mem? What happens if the profile collected is not indicative of the most optimal layout? This is already an issue with JIT compilation as its dynamic nature is a blessing and a curse.
It's nice to have default tuning done for you, but for high perf scenarios, there needs to be manual control exposed.
Sent from my phone
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
-----Original Message-----
From: Vitaly Davidovich
Sent: Apr 17, 2014 1:40 PM
To: mechanica...@googlegroups.com
Subject: Re: the real latency performance killerIt's not that people think they're smarter than machines (although, that's true as well - people are the ones designing them, machines are just much quicker than humans). The issue is that as you go lower in the stack, things become more and more generalized. As a developer of some system, you typically know a lot more about data and its flow than anything lower than you. In those cases, you want more control because, well, you happen to know more about your usecase. The machine can pick up some patterns automatically (e.g. branch prediction, prefetch, etc), but they're going to be general "obvious" patterns.
The idea of having JVM do dynamic layout based on some cpu feedback has been brought up before, but this is a hard problem. What happens if workload changes? Are you going to re-layout everything? Leave it be? How is this data going to be collected and for what memory access? What is the perf implocation, cpu + mem? What happens if the profile collected is not indicative of the most optimal layout? This is already an issue with JIT compilation as its dynamic nature is a blessing and a curse.
It's nice to have default tuning done for you, but for high perf scenarios, there needs to be manual control exposed.
Sent from my phone
Agreed. I also think that the JVM can provide real gains here with dynamic memory analysis, rather than static compiler optimizations (and manually laying out of shared structures / cache lines, etc. by the developer).
People knock the higher level abstractions in Java, and continually want lower direct access to the hardware, because people think they are smarter than the machine. People will eventually figure out that using higher level abstractions will lead to better performance when they let go of their egos.
On Thursday, April 17, 2014 1:15:13 PM UTC-5, Martin Thompson wrote:
When it comes to memory access performance 4 major things matter:
- Volume of data you are shifting, but this is becoming less of an issue with every generation as bandwidth keeps taking huge strides forward.
- Locality: If you are in the same cache line or page then you benefit from warm data caches, TLB caches, and DDR sense amplifier row buffers.
- Predictable access patterns mean the prefetchers can hide the latency by prefeching the data in time for your instructions needing it. Pointer chasing is bad.
- Non-uniform memory access (NUMA) effects. When crossing interconnects between sockets you need to add 20ns for each one way hop, and depending on your CPU version you may not get prefetch support and be subject to unexpected writebacks. You need to get use to the likes of numactl and cgroups to ensure your processes run and access memory where you expect.
If your code shows no sympathy to the memory subsytems then you can pay a big performance price.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
On Apr 17, 2014, at 8:23 PM, Robert Engels <ren...@ix.netcom.com> wrote:
> It's mainly for Java, but can apply large C applications as well with certain allocators. By splitting the large process into multiple smaller ones, then isolating them, heap access by some tasks can't affect the cache of the other smaller "important" tasks.
I don't see how this will help. Threads within a process or in different processes will have the same over all effect on the cores and associated cache. Threads and processes are abstractions or ways of organizing things for humans. By the time they hit a core....
Regards,
Kirk
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
-----Original Message-----
From: Martin Thompson
Sent: Apr 17, 2014 2:11 PM
To: mechanica...@googlegroups.com
Subject: Re: the real latency performance killer
This is a pipe dream. If you have studied modern hardware you realised the hierarchy will get deeper and we are moving to core local and tiled memory.On 17 April 2014 19:35, Robert Engels <ren...@ix.netcom.com> wrote:Also, if the hardware engineers could figure out a way to make main memory access as fast as L1 cache, all these problems go away...
--
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
There's already plenty of abstraction, through all layers, hardware to software. You basically want something that given any random app X, the runtime will special case it in a nearly optimal way automatically and on-the-fly; and do it in most performant way; and do it everytime. That's not going to happen - it'll get you the 80%, we need control over the other 20 (or even less, but there remains a need for manual control over the small yet important/hot percent of the codebase).
Sent from my phone
-----Original Message-----
From: Vitaly Davidovich
-----Original Message-----
From: Martin Thompson
Sent: Apr 17, 2014 2:49 PM
To: mechanica...@googlegroups.com
Subject: Re: the real latency performance killer
I think some of the really interesting work on high-level abstractions in this area is on "Cache Oblivious Algorithms".Here is a nice blog on potential speedup.
On 17 April 2014 20:42, Robert Engels <ren...@ix.netcom.com> wrote:I agree that that is the likely direction (my original comment was intended as a joke), but that even makes more of a case for higher level abstractions to take advantage of huge (> 1024 core) machines with larger local caches.With higher abstractions it becomes much easier to break processes apart and transparently integrate when needed (sometimes with no developer effort, everything is RMI, etc.), and let the OS/JVM figure out what to run where.Trying to do this manually with massively parallel machines is very difficult.
--
>To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
You are very correct... my bad.We use that exact setup to avoid the GC threads thrashing the cache, but the GC still moves objects around which destroys the cache anyway...Keeping needed code and cache hot (in an idle loop sense) is not always possible... (or often not easy to do without very ugly code).
To clarify on your first point though, if the other cores all have working sets within their L2 cache size (or less restrictive, their total working sets within L3 - sizeof(L2)), aren't you essentially protecting the L2 and L1 used by the isolated core?
But to bring us back where I started, this is why hardware support for non-cached read/writes with the ability to control which threads/classes use these calls might be helpful.
Your points also seem to suggest that specialized/embedded systems would probably do better with non-inclusive L3 caches - protecting the per core L2 based on the configured loads. Do you know any architectures using such a setup?
Your points also seem to suggest that specialized/embedded systems would probably do better with non-inclusive L3 caches - protecting the per core L2 based on the configured loads. Do you know any architectures using such a setup?
There is a good reason for the lowest level cache (aka "LLC"; the one closet to memory) being inclusive in pretty much all multi-socket architectures. When the LLC is inclusive of all closer-to-the-cores caches, coherency traffic is only needed between LLCs, and only for coherency state transitions at the LLC level. The hit and miss rates (in ops/sec, not in %) in LLCs are orders of magnitude smaller than this win L1 (and L2 where it exists). If the LLC was not inclusive, state changes in the inner caches would need to be communicated to all other caches, and the cross-socket coherency traffic volume would grow by a couple orders of magnitude, which simply isn't practical with chip-to-chip interconnects and pin counts.
On Thursday, April 17, 2014 6:45:36 PM UTC-7, Robert Engels wrote:To clarify on your first point though, if the other cores all have working sets within their L2 cache size (or less restrictive, their total working sets within L3 - sizeof(L2)), aren't you essentially protecting the L2 and L1 used by the isolated core?
Nothing practical (that interacts with the outside world) lives purely within it's L2 cache size. At the very least, any network i/o you are doing will be moving in and out of the L3 cache. After ~20MB of network traffic, or any other memory traffic, all your idle (not actively being hit) L2 and L1 contents will have been thrown away, and will generate new L3 cache misses. So if your isolated core is mostly idle or spinning (which is usually the case), and it does not actively access the contents of it's L2, any other activity in the socket will cause that cold L2 to get thrown away.
But to bring us back where I started, this is why hardware support for non-cached read/writes with the ability to control which threads/classes use these calls might be helpful.
Non-cached read/writes are usually used for specialized i/o operations... They are so expensive (compared to cached ones) that I highly doubt this will be used for anything real (like "everything this thread does is non-cached"). Remember that non-cached also mean non-streaming and non-prefetcheable. It also means that each word or byte access is a separate ~200 cycle memory access. Also remember that stack memory is generally indistinguishable from local memory, and that the CPU has a limited set of registers... That all adds up to "threads/classes that are forced to use non-cacheable memory for everything are not useful for much of anything"
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Agreed. I also think that the JVM can provide real gains here with dynamic memory analysis, rather than static compiler optimizations (and manually laying out of shared structures / cache lines, etc. by the developer).People knock the higher level abstractions in Java, and continually want lower direct access to the hardware, because people think they are smarter than the machine. People will eventually figure out that using higher level abstractions will lead to better performance when they let go of their egos.
Follow up answers inline.
On Thursday, April 17, 2014 6:45:36 PM UTC-7, Robert Engels wrote:You are very correct... my bad.We use that exact setup to avoid the GC threads thrashing the cache, but the GC still moves objects around which destroys the cache anyway...Keeping needed code and cache hot (in an idle loop sense) is not always possible... (or often not easy to do without very ugly code).
Yes. It's ugly. But not keeping it warm guarantees it won't last long in the presence of other activity in the same socket.Luckily, both spacial and temporal locality are alive and well in most applications, and hardware prefetchers are really good at dealing with multi-line access patterns, so missing this stuff back into the L1 is not that big a deal (sub usec hits to get back to being warm, usually). But that's just as true when GC kicked your objects out or moved them around...
To clarify on your first point though, if the other cores all have working sets within their L2 cache size (or less restrictive, their total working sets within L3 - sizeof(L2)), aren't you essentially protecting the L2 and L1 used by the isolated core?
Nothing practical (that interacts with the outside world) lives purely within it's L2 cache size. At the very least, any network i/o you are doing will be moving in and out of the L3 cache. After ~20MB of network traffic, or any other memory traffic, all your idle (not actively being hit) L2 and L1 contents will have been thrown away, and will generate new L3 cache misses. So if your isolated core is mostly idle or spinning (which is usually the case), and it does not actively access the contents of it's L2, any other activity in the socket will cause that cold L2 to get thrown away.
But to bring us back where I started, this is why hardware support for non-cached read/writes with the ability to control which threads/classes use these calls might be helpful.
Non-cached read/writes are usually used for specialized i/o operations... They are so expensive (compared to cached ones) that I highly doubt this will be used for anything real (like "everything this thread does is non-cached"). Remember that non-cached also mean non-streaming and non-prefetcheable. It also means that each word or byte access is a separate ~200 cycle memory access. Also remember that stack memory is generally indistinguishable from local memory, and that the CPU has a limited set of registers... That all adds up to "threads/classes that are forced to use non-cacheable memory for everything are not useful for much of anything"
Your points also seem to suggest that specialized/embedded systems would probably do better with non-inclusive L3 caches - protecting the per core L2 based on the configured loads. Do you know any architectures using such a setup?
There is a good reason for the lowest level cache (aka "LLC"; the one closet to memory) being inclusive in pretty much all multi-socket architectures. When the LLC is inclusive of all closer-to-the-cores caches, coherency traffic is only needed between LLCs, and only for coherency state transitions at the LLC level. The hit and miss rates (in ops/sec, not in %) in LLCs are orders of magnitude smaller than this win L1 (and L2 where it exists). If the LLC was not inclusive, state changes in the inner caches would need to be communicated to all other caches, and the cross-socket coherency traffic volume would grow by a couple orders of magnitude, which simply isn't practical with chip-to-chip interconnects and pin counts.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
Did you read my blog post that started this? A real world example of why this is just not the case... And I'm certain this is more common than people think.
Another great example is Linux itself. Look at the performances gains that have been made over the years. But its only been possible with massive numbers of developers and massive numbers of bugs.
All most all of the big gains are from algorithmic changes which are often hard to get into the tree... Just to try and ensure correctness, and limit possible cross module affects, and so that the other developers can understand the scope of the changes.
Then you have frameworks like "the disrupter" and when it does real work it is only 1% faster than the standard methods. And now they use Azul Zing anyway because writing gc-less java isn't java and so you my might as well write in C.
On the non-cached read/writes, I think I am being a bit misunderstood. What I am proposing is to basically be able to a core as a non-SMP core. Obviously the locking and memory fencing needs to be more sophisticated, but I think you can see where I am headed.
Also, one quick question in regards to your statement on network io. Wouldn't the driver/card do the io directly to memory bypassing the cache. And then wouldn't the driver move the memory directly to the mapped buffer space before reading, thereby reusing the addresses line and never affecting the cache. It would seem that just remapping the buffer and destroying the cache in the process would be too expensive overall ? Otherwise it would seem that any high volume network application is never using the L3 cache anyway...
-----Original Message-----
From: Martin Thompson
Sent: Apr 18, 2014 9:23 AM
To: mechanica...@googlegroups.com
Subject: Re: the real latency performance killer
So is your point that the efforts of this group are futile? Is your quest to prove us all wrong and that we should not care for what is under all the abstractions? Without a really compelling case this sort of approach will come across to others as trolling whatever the underlying motivation.The reason Mechanical Sympathy started in motor racing was because many drivers had reached the point of blindly accepting abstractions and this resulted in not only reduced performance but also greatly increased risk of harm.We are all a product of our own experience. In my experience showing mechanical sympathy can not just result in enormous performance gains, 3 to 10 fold improvements in response time and throughput is not uncommon, but also much more robust applications.Every memory system in common use moves data between the levels in its hierarchy in blocks, not bits or bytes, and has done it this way for a long time and will for a long time more. They offer their services by taking bets on temporal, spacial, and patterned based usage. That is all they can offer. Are you arguing that people should not understand these fundamental abstractions? To my mind that is mechanical sympathy. We have abstractions, we don't need to know intimate detail, but we need the appropriate level of detail. Without that appropriate level of detail, not only does performance suffer, code is a lot less robust. I've seen so many bugs in networking and storage code due to a lack of understanding of the basic abstractions.
Abstractions are at their best when small, composable, and fractal. I cringe when people talk about abstractions that are these huge monoliths that do not compose or have fractal characteristics, yes big frameworks I'm looking at you! :-)
On Friday, 18 April 2014 14:55:57 UTC+1, Robert Engels wrote:
Did you read my blog post that started this? A real world example of why this is just not the case... And I'm certain this is more common than people think.
Another great example is Linux itself. Look at the performances gains that have been made over the years. But its only been possible with massive numbers of developers and massive numbers of bugs.
All most all of the big gains are from algorithmic changes which are often hard to get into the tree... Just to try and ensure correctness, and limit possible cross module affects, and so that the other developers can understand the scope of the changes.
Then you have frameworks like "the disrupter" and when it does real work it is only 1% faster than the standard methods. And now they use Azul Zing anyway because writing gc-less java isn't java and so you my might as well write in C.
On April 18, 2014 8:08:32 AM CDT, "Rüdiger Möller" <moru...@gmail.com> wrote:Am Donnerstag, 17. April 2014 20:27:48 UTC+2 schrieb Robert Engels:Agreed. I also think that the JVM can provide real gains here with dynamic memory analysis, rather than static compiler optimizations (and manually laying out of shared structures / cache lines, etc. by the developer).People knock the higher level abstractions in Java, and continually want lower direct access to the hardware, because people think they are smarter than the machine. People will eventually figure out that using higher level abstractions will lead to better performance when they let go of their egos.Depends on the market you are in. In a competitive environment, your "higher level abstractions" Java app will be always behind like 20% (at best) compared to a manually optimized "to-the-metal" application. Actually hardware innovation cycles are not that high. Frequently only some percent of your overall codebase is tweaked to perform on current hardware, so you might overestimate the cost of mechanical sympathy.
--
-----Original Message-----
From: Gil Tene
Sent: Apr 18, 2014 10:33 AM
To: ""
Subject: Re: the real latency performance killer
Sent from my iPad
Follow up answers inline.
>To: "mechanica...@googlegroups.com" <mechanica...@googlegroups.com>
>Subject: Re: the real latency performance killer
>
>
>On Apr 17, 2014, at 8:23 PM, Robert Engels <ren...@ix.netcom.com> wrote:
>
>> It's mainly for Java, but can apply large C applications as well with certain allocators. By splitting the large process into multiple smaller ones, then isolating them, heap access by some tasks can't affect the cache of the other smaller "important" tasks.
>
>I don't see how this will help. Threads within a process or in different processes will have the same over all effect on the cores and associated cache. Threads and processes are abstractions or ways of organizing things for humans. By the time they hit a core....
>
>Regards,
>Kirk
>
>--
>You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
>To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.
>To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
>For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
A a 30+ year engineer, I know marketing crap. Take the IMAX 'disrupter'. It's crap. In real world test it's only marginally faster, with much greater complexity and constraints.
-----Original Message-----
From: Rüdiger Möller
Sent: Apr 18, 2014 10:08 AM
To: mechanica...@googlegroups.com
Subject: Re: the real latency performance killer
Am Freitag, 18. April 2014 15:55:57 UTC+2 schrieb Robert Engels:
Did you read my blog post that started this? A real world example of why this is just not the case... And I'm certain this is more common than people think.
I read, however beating some unknown C++ library with an unknown Java implementation tells me what ?
Another great example is Linux itself. Look at the performances gains that have been made over the years. But its only been possible with massive numbers of developers and massive numbers of bugs.
An OS without mechanical symphathy hardly would have been successful, isn't it ?
All most all of the big gains are from algorithmic changes which are often hard to get into the tree... Just to try and ensure correctness, and limit possible cross module affects, and so that the other developers can understand the scope of the changes.
I regulary speed up programs by factors of 2 to 10 written from people that believe in these popular hoaxes. In many cases choosing the "best algorithm" is trivial, but still one implementation is 5 times faster than another. On the business side: Performance still matters as cloud cost scales pretty much linear with app performance. Operational cost can be an issue if you need to operate a cluster of 5 servers to solve a problem which could be done on a single machine.
Then you have frameworks like "the disrupter" and when it does real work it is only 1% faster than the standard methods. And now they use Azul Zing anyway because writing gc-less java isn't java and so you my might as well write in C.
There is a difference inbetween being "GC'less" and wasting memory like there is no tomorrow. Most developers use abstractions and frameworks without even knowing the cost. However an architectural decision always needs to compare benefit and cost. You are the best example above ("1%",) as when used in the correct place, the speedup of using pipelining compared to naive queuing/pool executors is massive.No offence, but I sometimes cannot understand why there are so many "performance myths" which are plain out wrong, so many frameworks and libraries are hyped but in reality they beam you back like 10 years performance wise (without need to do so).Ofc performance might not be the most important thing for many apps, however one should be able to quantify the performance cost if a decision is made to use a specific design pattern, framework or abstraction.
I am slightly confused on
It doesn't matter how the network traffic gets to/from memory from/to the NICs. If any core in your socket interacts with the data in the network traffic at any point, that data would be moving in and out if the L3, evicting other cold data as it goes. That's why with an LRU cache, cold data only survives on idle sockets (ones that don't bring anything into their L3).
Imagine the case where you only had a single buffer of 1k, and that ALL network traffic went through (if you didn't process the packet fast enough it was dropped). How would this destroy the cache? Isn't the buffer going to be at a fixed location in main memory and retain the same physical address mapped in the cache, so at most reading/writing to this buffer could only destroy 1k of the cache ???
-----Original Message-----
From: Gil Tene
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
And your cloud comment is nonsense. It's a prime reason for the success of 'Ruby On Rails'. You know how that optimize? Add another server... (because it's dog shit slow), and it is currently the most popular cloud based infrastructure component
-----Original Message-----
From: Rüdiger Möller
Sent: Apr 18, 2014 12:26 PM
To: mechanica...@googlegroups.com
Cc: Robert Engels
Subject: Re: the real latency performance killer
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
I think you'll find that for all but the largest of users, the infrastructure costs of adding another server(s) to double performance is far cheaper than hiring a developer(s) to do the same...
I've been given this some more thought, and given that the network buffers size is certainly greater than L3 cache, wouldn't it be better for the NIC to write directly to isolated main memory (bypassing the cache) when queuing the incoming packet, and then the kernel perform a memcpy to a constant "processing buffer in the cache", maybe one per open socket, only affecting the cache then? Otherwise it would seem that on even decently fast networks (even 1gb), the network traffic alone makes the L3 cache useless.It's just a thought... It just seems strange that you can conceivably send a packet (rare, low frequency) across a high speed network to an idle machine (with intact cache), which then in turn acts on it, and get far better performance than trying to do the work on the machine receiving the actual data. Seems ripe for some sort of better partitioning scheme (although as you stated, you would probably get the best performance by just processing the request on an isolated socket).
-----Original Message-----
From: Rüdiger Möller
Sent: Apr 18, 2014 2:22 PM
To: mechanica...@googlegroups.com
Cc: Robert Engels
Subject: Re: the real latency performance killer
-----Original Message-----
From: Martin Thompson
Sent: Apr 18, 2014 3:03 PM
To: mechanica...@googlegroups.com
Subject: Re: the real latency performance killer
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
It was clearly an exaggeration, but I have experienced multiple times in my career when dealing with large scale enterprise application that takes years to develop and deploy (with business requirements changing all the time), that Moore's law holds, and that by the time you are releasing many times what was a performance bottleneck is no more - and by focusing on the flexibility of the design first, you aren't stuck with an outdated application by the time it is released.
I have experienced many times a business constrained by the performance of its systems and unable to quickly enough add hardware to release new product, sell more, or add features expected by the market, or worse to deal with new regulatory requirements.
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
Take that same app, and it has a job that takes 10 hours to complete, and it runs overnight (or in the background). Am I going to hack it to pieces to make it run in 8 hrs ? Unless it is a COMPLETELY O(N2) shit design, you aren't going to make the 10 hr job complete in 10 seconds, no matter what coding changes you make...
It was clearly an exaggeration, but I have experienced multiple times in my career when dealing with large scale enterprise application that takes years to develop and deploy (with business requirements changing all the time), that Moore's law holds, and that by the time you are releasing many times what was a performance bottleneck is no more - and by focusing on the flexibility of the design first, you aren't stuck with an outdated application by the time it is released.
You know that depends on the use case...If I have an operation that takes 1 sec to complete and it's done 3 times a day, and it's got really clean well designed code, and is easy to maintain, am I going to hack it to make it work 20% faster??? I would hope not.Take that same app, and it has a job that takes 10 hours to complete, and it runs overnight (or in the background). Am I going to hack it to pieces to make it run in 8 hrs ? Unless it is a COMPLETELY O(N2) shit design, you aren't going to make the 10 hr job complete in 10 seconds, no matter what coding changes you make...The worst is to design for the 10 seconds at the start and realize that no matter what you do it's going to take 8 hrs, now you have the complex code to maintain, and good luck scaling out, etc.
That is why I make reference to the 'Disrupter' as it's really easy to show that it doesn't matter... The test cases are well written, and it includes comparison tests with "more standard" implementations. Just make the worker do something extremely simple like send a multicast udp packet. This operation is orders of magnitude slower than the message/queue processing overhead, so these "performance improvements" quickly go out the window...
I will post the more realistic test cases next week when I'm in the office.
I will also point to the fact that LMAX needed to move to Azul Zing anyway because the approach of no garbage java just doesn't work for all but the most trivial systems.
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
I will post the more realistic test cases next week when I'm in the office.
I will also point to the fact that LMAX needed to move to Azul Zing anyway because the approach of no garbage Java just doesn't work for all but the most trivial systems.
Isn't this your employees blog? Do you read it? http://mechanitis.blogspot.com/2011/06/dissecting-disruptor-whats-so-special.html?m=1
And what you state is a load of nonsense. You provide benchmarks and claim 55 million oops a sec, and it is essentially adding a long. My point was add in real work or any io and your techniques provide a marginal benefit over standard solutions. A system without io doesn't exist...
When you take these things into account you'll realize that horizontal scaling makes any benefits the framework provides for enterprise class systems irrelevant. See Google...
Don't get me wrong, we use many of the same internal techniques as the disrupter, but we're in a very specialized space with very uncommon constraints.
Selling mechanical sympathy to the world is selling snake oil. Or selling guns to children. And its a bit disgusting.
On 19 April 2014 16:30, Robert Engels <ren...@ix.netcom.com> wrote:
Isn't this your employees blog? Do you read it? http://mechanitis.blogspot.com/2011/06/dissecting-disruptor-whats-so-special.html?m=1I I have read this and know Trish well. What part of this article illustrates why LMAX moved to Zing because we could not write garbage free code as you stated?
And what you state is a load of nonsense. You provide benchmarks and claim 55 million oops a sec, and it is essentially adding a long. My point was add in real work or any io and your techniques provide a marginal benefit over standard solutions. A system without io doesn't exist...
The point of the test is to exercise the concurrency model under contention. Before making unfounded statements you should read up on Amdahl's Law and Universal Scalability Law to understand why keeping the contention and coherence cost low is so important. If you studied science you know the importance of performing a clean experiment that is isolated from noise.
When you take these things into account you'll realize that horizontal scaling makes any benefits the framework provides for enterprise class systems irrelevant. See Google...Horizontal scaling is a valid option when you can do it. Google have probably written more proprietary code than most to achieve what they have. This takes a great deal of study, experimentation, and mechanical sympathy. Should they just have sat around and waited for hardware to catch up?
Don't get me wrong, we use many of the same internal techniques as the disrupter, but we're in a very specialized space with very uncommon constraints.
Selling mechanical sympathy to the world is selling snake oil. Or selling guns to children. And its a bit disgusting.As you have said yourself you have not made these techniques work. Many others have. If you cannot use a tool that many others have then the source of the issue is pretty obvious.
Such an emotive response with no substance again.
On Saturday, April 19, 2014 10:45:17 AM UTC-5, Martin Thompson wrote:On 19 April 2014 16:30, Robert Engels <ren...@ix.netcom.com> wrote:
Isn't this your employees blog? Do you read it? http://mechanitis.blogspot.com/2011/06/dissecting-disruptor-whats-so-special.html?m=1I I have read this and know Trish well. What part of this article illustrates why LMAX moved to Zing because we could not write garbage free code as you stated?From Micheal Barker, in his comments, "As the amount of memory used by the system remains static is reduces the frequency of garbage collection."If you have a static memory system there is no garbage collection by definition. Seriously, did you read it?I assume you decided to use Zing because you realized that doing what you were doing led to all sorts of issues for an exchange.. no 24 hr cycles due to reboots to clear garbage, too difficult to adapt to regulatory changes because writing code in this style is not productive, writing everything from scratch to ensure no garbage generation, unlimited order books require dynamic structures or at least pointer references, etc.
And what you state is a load of nonsense. You provide benchmarks and claim 55 million oops a sec, and it is essentially adding a long. My point was add in real work or any io and your techniques provide a marginal benefit over standard solutions. A system without io doesn't exist...
The point of the test is to exercise the concurrency model under contention. Before making unfounded statements you should read up on Amdahl's Law and Universal Scalability Law to understand why keeping the contention and coherence cost low is so important. If you studied science you know the importance of performing a clean experiment that is isolated from noise.Your 55 millions ops claim comes from a single writer/reader test passing a single long - and this is a contention test??? I think you might be forgetting all the stuff you/your company has written...Enough academic papers have been written that prove this micro benchmark fallacy. Here's one for your reading ftp://ftp.cs.cmu.edu/project/mach/doc/published/benchmark.ps because you're obviously confused her
When you take these things into account you'll realize that horizontal scaling makes any benefits the framework provides for enterprise class systems irrelevant. See Google...
Horizontal scaling is a valid option when you can do it. Google have probably written more proprietary code than most to achieve what they have. This takes a great deal of study, experimentation, and mechanical sympathy. Should they just have sat around and waited for hardware to catch up?Actually, a lots of Google guts are fairly standard, and the stuff that is general purpose that doesn't compete with them, they open-source. I have 4 proteges that work there in engineering now. You can also review the android source (although they didn't write most of that). Not a lot of mechanical sympathy there (outside of the Linux kernel).
Don't get me wrong, we use many of the same internal techniques as the disrupter, but we're in a very specialized space with very uncommon constraints.
Selling mechanical sympathy to the world is selling snake oil. Or selling guns to children. And its a bit disgusting.As you have said yourself you have not made these techniques work. Many others have. If you cannot use a tool that many others have then the source of the issue is pretty obvious.We make these techniques work. Very well, in fact in many cases much better than similar tests run using the Disrupter. The difference is that even within the company we hide these details from the upper layers so they don't have to deal with it. We don't attempt to promote this level of detail up through the company, nor certainly to the world. We hide our esoteric improvements behind abstractions so the upper layer people still work with well understood paradigms.
Such an emotive response with no substance again.Again garbage. Like I said, I'll post the tests next week, and then maybe people will see your benchmarks for what they are... misleading.
If you think the several hundred nanosecond improvement matters on a generalized OS, CPU, enterprise system, why don't you just hand code in assembly on specialized hardware with specialized cache controllers, etc. and really go for it - if it matters that much.
but you guys are a bat shit crazy cult and I'm gone.
I was referred to this group by a colleague, and the participants are certainly more knowledgeable than myself, but I'd like to throw out my two cents anyway.As a recent blog post of my showed, Java easily outperforms C++ in real-world tests, but even this test is flawed...The problems is that even though this is a real-world test, doing "real" work, it is still essentially a micro-benchmark. Why?Which brings me to the heart of the problem... Here are the memory access times on a typical modern processor:Core i7 Xeon 5500 Series Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles remote
remote L3 CACHE ~100-300 cycles
Local Dram ~60 ns
Remote Dram ~100 nsSo with my "real-world" test, the heart of the code path is always in level 1 cache, with the predicative loading of the cache when the message object is retrieved.Now, compare this with a true real-world application, with gigabytes of heap. Most modern processors have about 20 mb of shared level 3 cache, which is a fraction of the memory in use, so when the garbage collector is moving things around, and or background "house-keeping" tasks are doing their work, they are blowing out the CPU caches, (even the non-shared L2 cache is destroyed by a compacting garbage collector). Even isolated CPUs don't help with the latter.So when your low-frequency, but low-latency (say sending an order in response to some market event), this code is going to run 5x (or more if NUMA is involved) slower than the micro-benchmark case due to the non-cached main memory access.How do we fix this? 2 ways.With CPU support for "non-cached reads and writes", a thread or (possibly a class/object) can be marked as "background", and then memory access by this thread/class do not go through the cache, hopefully preserving the L2/L3 cache for the "important" threads.Similarly, an object/class marked "important" is a clue to the garbage collector to not move this object around if at all possible. This can sort of be solved now with off-heap memory structures, but they're are pain (at least in the current incarnation).Without something similar to the above, I just don't think low-frequency and low-latency is possible.
disruptor.handleEventsWith(decoders).then(new EventHandler<TestRequest>() {@Overridepublic void onEvent(TestRequest event, long sequence, boolean endOfBatch) throws Exception {event.process(sharedData);}}).handleEventsWith(encoders)
-----Original Message-----
From: Michael Barker
Sent: Apr 21, 2014 2:35 PM
To: "mechanica...@googlegroups.com"
Subject: Re: the real latency performance killerHi,I can't get the patches to apply cleanly, they seem to be built against a quite out of date version of the Disruptor. The project layout changed significantly just prior to the release of 3.0 and these appear to be baselined against the older project layout. Any chance you could build them against the latest from Github. https://github.com/LMAX-Exchange/disruptorMike.
On 22 April 2014 06:17, Robert Engels <ren...@ix.netcom.com> wrote:
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
Robert,I think you are missing the point: In many cases there is no "real work".Example: an inmemory key value store will have work like "map.put". As most applications today work inmemory, business logic has become trivial and for many applications scheduling logic/cost is the major issue.
Second, you probably miss anther thing: Disruptor is more than a queue. Regarding your sample:If done right you'd probably add another eventhandler doing (smart) batching to fill several responses into a datagram. You won't send a reply datagram directly from an incoming processing thread ever.
Another example of leveraging extreme low overhead of disruptors inter-core communication:A simple service using one or more EventHandlers to decode incoming requests, one handler (=thread) to perform (mostly trivial) core business logic, one (or more) threads to encode the results again for network.This wayi one can split work to several cores with very low overhead. If you try this using threadpools/JDK queues, you'll fail. Because of their inherent overhead, you need a lot of "real work" to make them scale. Practice has many examples where you hit the break even using disruptor easily, but not with alternative solutions.
-----Original Message-----
From: Robert Engels
Sent: Apr 21, 2014 2:38 PM
To: mechanica...@googlegroups.com
Subject: Re: the real latency performance killer
I'll see if I can easily rebase off 3.0. It was 2.10.4.
-----Original Message-----
From: Michael Barker
Sent: Apr 21, 2014 2:35 PM
To: "mechanica...@googlegroups.com"
Subject: Re: the real latency performance killerHi,I can't get the patches to apply cleanly, they seem to be built against a quite out of date version of the Disruptor. The project layout changed significantly just prior to the release of 3.0 and these appear to be baselined against the older project layout. Any chance you could build them against the latest from Github. https://github.com/LMAX-Exchange/disruptorMike.
On 22 April 2014 06:17, Robert Engels <ren...@ix.netcom.com> wrote:
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/QMaiYtYj4rk/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
The performance tests no longer run a comparison, but:The OneToOneSequencedThroughputTest runs fine, and the numbers are 49 m ops/ sec. Almost identical to the 2.10 version.When I run the OneToOneSequencedThroughputTest with 'real work', I get:
When I attempt to run the test (unmodified) OneToOneSequencedBatchThroughputTest it just hangs. So, it appears the batching code is broken and has concurrency issues.... hmmm.