--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
[Some of this is cut/pasted from emails in which I had answered similar questions]A few clarification points before you answer the question themselves:You mentioned that throughput is not the main concern (<1K messages/sec expected), and that you want to minimize the latency between receiving an event and sending a response. But when you say "minimize the latency", do you mean:a) you want to minimize the best case latencyb) you want to minimize the median latencyc) you want to minimize the worst experienced latencyd) some other level (e.g. minimize the 99.9%'lie)The answers to some of your questions will vary depending on your answers. And the right answer is probably to have goals for at least three of the points above (b, c, and d at least).If all you care about is the best case and the median latency, you are probably best off with a single socket, highest clock rate, small core count setup with HT turned off. So what if your latency jumps to 10-20msec every once in a while? You'll great best case and median numbers to show off, and you said you don't care about high percentiles.However, if you care at all about you higher percentiles, you'll want to make sure to have enough cores to accommodate the peak number of runnable threads in your system, with some room to spare, and even then you may want to play some taskset, numactl, or isolcpu games. This is where higher core counts, dual socket setups, and turning HT on comes into play: they all help keep the high percentiles and outliers down (in both frequency and magnitude).
As for NICs. My experience is that burning a core or two on OpenOnload or the like (spinning with no back) is worth a couple of usec in the common case, and that this is especially true for non-saturated workloads (like your <1K/sec), where each latency is measured from a standstill (repos ending to a single event). If what you care about is latency (and not power, space, or saving a few $), burning those cores is a cheap way to get what you want. Same goes for the disruptor. Burning cores on controlled & dedicated spinning buys you better latency across the board, as long as you have plenty of cores to burn (it gets really bad when you don't have those cores).When it comes to comparing clock frequencies, I usually tell people to compare the max turbo boost frequencies, rather than the nominal ones. E.g. an E5-2697 V3 has 14 cores, 35MB of L3, and shows 2.6GHz nominal. An E5-2687W V3 has 10 cores, 25MB of L3, and shows 3.1Ghz nominal. And an E5-2667 V3 has 8 cores, 20MB of L3, and shows 3.2Ghz nominal. So you may think the 3.2GHz number is best. But if you look carefully, all three chips have the same Turbo Boost frequency (3.6Ghz), and chances are that with the same number of cores actually busy, they'll all be running at the same frequency. But the larger chip (E5-2697) has more cores, which can help with outliers and allow for more generous "burning" of cores. And it has a significantly bigger L3m which can help with everything.
When you are latency sensitive, your best hint as an initial filter is power envelope: you should be looking at and choosing between chips in the higher power envelope rankings (135-160W usually). The lower power chips will keep your frequencies lower. [Also note that desktop chips tend to be classified by their turbo boost, as in "up to 4.4GHz", while server chips focus on nominal freq. (2.6GHz rather than "up to 3.6GHz"), that probably marketing driven].
For the common single socket vs. dual socket question, and Haswell E vs E5, I usually lean towards the E5s. Yes, something like the i7-4790K has a higher clock rate: 4.0GHz nominal, boosting to 4.4Ghz, but it only has 4 physical cores, and with the disruptor and a good NIC stack, you'll be burning those cores down fast, and may have 10-20msec outliers to deal with quickly, causing you to prematurely give up on spinning, and start paying in common case latency for this lack or spinning cores. Another E5 is memory capacity and bandwidth. i7s tend to leak at 32GB, and have only 2 memory channels, while E5s have 4 channels, support faster DRAMs (e.g. 2133 vs. 1600), and a lot more memory (typical config of 128GB per socket is commonplace in 2 socket servers today. E.g. that's what the latest EC2 machines seem to be built with).
As for 1 vs. 2 sockets: most 1 vs. 2 socket situations where the 1 socket system "wins" can be matched with nuactl and locking down workload (including their RAM) to one of the two sockets. 2 socket setups have the benefit of allowing you to separate background stuff (like the OS and rarely active processes or threads, and potentially your own non-crticial threads) from you critical workload. They do require more care, e.g. with things like IRQ assignment and making sure your packets are processed (in the network stack) in the same L3 that your consuming threads will be on, but at the end that extra capacity can be a lifesaver.Bottom line:When people ask me about a "good" system for low latency (and this happens a lot), I answer that if $$ or power are not the drivers, and latency is, my current recommendation is to go with a 2 socket E5 (with something like an E5-2697 V3 rightness), and 128GB-256GB of memory. The street price for that is somewhere between $8K and $12K ($3K-$4K of which is spent when focusing on the faster, higher core count Xeons). I also recommend investing in a good 10G NIC in (like Solarflare) which will add some to that price, but buy you a few more usec.
[Some of this is cut/pasted from emails in which I had answered similar questions]A few clarification points before you answer the question themselves:You mentioned that throughput is not the main concern (<1K messages/sec expected), and that you want to minimize the latency between receiving an event and sending a response. But when you say "minimize the latency", do you mean:a) you want to minimize the best case latencyb) you want to minimize the median latencyc) you want to minimize the worst experienced latencyd) some other level (e.g. minimize the 99.9%'lie)The answers to some of your questions will vary depending on your answers. And the right answer is probably to have goals for at least three of the points above (b, c, and d at least).If all you care about is the best case and the median latency, you are probably best off with a single socket, highest clock rate, small core count setup with HT turned off. So what if your latency jumps to 10-20msec every once in a while? You'll great best case and median numbers to show off, and you said you don't care about high percentiles.However, if you care at all about you higher percentiles, you'll want to make sure to have enough cores to accommodate the peak number of runnable threads in your system, with some room to spare, and even then you may want to play some taskset, numactl, or isolcpu games. This is where higher core counts, dual socket setups, and turning HT on comes into play: they all help keep the high percentiles and outliers down (in both frequency and magnitude).
As for NICs. My experience is that burning a core or two on OpenOnload or the like (spinning with no back) is worth a couple of usec in the common case, and that this is especially true for non-saturated workloads (like your <1K/sec), where each latency is measured from a standstill (responding to a single event). If what you care about is latency (and not power, space, or saving a few $), burning those cores is a cheap way to get what you want. Same goes for the disruptor. Burning cores on controlled & dedicated spinning buys you better latency across the board, as long as you have plenty of cores to burn (it gets really bad when you don't have those cores).When it comes to comparing clock frequencies, I usually tell people to compare the max turbo boost frequencies, rather than the nominal ones. E.g. an E5-2697 V3 has 14 cores, 35MB of L3, and shows 2.6GHz nominal. An E5-2687W V3 has 10 cores, 25MB of L3, and shows 3.1Ghz nominal. And an E5-2667 V3 has 8 cores, 20MB of L3, and shows 3.2Ghz nominal. So you may think the 3.2GHz number is best. But if you look carefully, all three chips have the same Turbo Boost frequency (3.6Ghz), and chances are that with the same number of cores actually busy, they'll all be running at the same frequency. But the larger chip (E5-2697) has more cores, which can help with outliers and allow for more generous "burning" of cores. And it has a significantly bigger L3 which can help with everything.
When you are latency sensitive, your best hint as an initial filter is power envelope: you should be looking at and choosing between chips in the higher power envelope rankings (135-160W usually). The lower power chips will keep your frequencies lower. [Also note that desktop chips tend to be classified by their turbo boost, as in "up to 4.4GHz", while server chips focus on nominal freq. (2.6GHz rather than "up to 3.6GHz"), that probably marketing driven].
For the common single socket vs. dual socket question, and Haswell E vs E5, I usually lean towards the E5s. Yes, something like the i7-4790K has a higher clock rate: 4.0GHz nominal, boosting to 4.4Ghz, but it only has 4 physical cores, and with the disruptor and a good NIC stack, you'll be burning those cores down fast, and may have 10-20msec outliers to deal with quickly, causing you to prematurely give up on spinning, and start paying in common case latency for this lack or spinning cores. Another E5 benefit is memory capacity and bandwidth. i7s tend to peak at 32GB, and have only 2 memory channels, while E5s have 4 channels, support faster DRAMs (e.g. 2133 vs. 1600), and a lot more memory (typical config of 128GB per socket is commonplace in 2 socket servers today. E.g. that's what the latest EC2 machines seem to be built with).
As for 1 vs. 2 sockets: most 1 vs. 2 socket situations where the 1 socket system "wins" can be matched with nuactl and locking down workload (including their RAM) to one of the two sockets. 2 socket setups have the benefit of allowing you to separate background stuff (like the OS and rarely active processes or threads, and potentially your own non-crticial threads) from you critical workload. They do require more care, e.g. with things like IRQ assignment and making sure your packets are processed (in the network stack) in the same L3 that your consuming threads will be on, but at the end that extra capacity can be a lifesaver.Bottom line:When people ask me about a "good" system for low latency (and this happens a lot), I answer that if $$ or power are not the drivers, and latency is, my current recommendation is to go with a 2 socket E5 (with something like an E5-2697 V3 rightness), and 128GB-256GB of memory. The street price for that is somewhere between $8K and $12K ($3K-$4K of which is spent when focusing on the faster, higher core count Xeons). I also recommend investing in a good 10G NIC in (like Solarflare) which will add some to that price, but buy you a few more usec.
On Thursday, October 23, 2014 1:35:42 PM UTC-7, Michael Mattoss wrote:
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/jdIhW0TaZQ4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
+1
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
What's isocups? Saw your reference to it but google didn't come up with much.
What's isocups? Saw your reference to it but google didn't come up with much.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
On 25 October 2014 16:40, Steve Morin <steve...@gmail.com> wrote:
What's isocups? Saw your reference to it but google didn't come up with much.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/jdIhW0TaZQ4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Perhaps I'm missing something here, but if you warm up the cache before the system switches to steady-state phase and you use thread affinity and isolcpus, why would the cache gets cold, even when the message rate is low? Does the CPU evict cache lines based on time?
1. What are pros & cons of disabling HT (other than the obvious reduction of logical cores)?
2. Does it make sense to enable HT to increase the total number of available cores but to isolate some of the physical cores and assigning them only 1 thread so that thread does share the physical core with any other threads?
3. Is there a rule of thumb on how to balance cache size against core speed?
3. Is there a rule of thumb on how to balance cache size against core speed?My rule of thumb is that cache is always better than core speed if you have non-i/o related cache misses. This obviously reverses at some point (At an L3 miss rate of 0.00001% I may go with core speed).The harder tradeoff is cache size vs. L3 latency. Because the L3 sits in a a ring in modern Xeons, the larger the L3 is the more hops there are in the ring, and the higher the L3 latency can get to some (random) addresses. It gets more complicated, too. E.g. on the newer Haswell chips, going above a certain number of cores may step you into a longer-latency huge L3 ring, or may force you to partition the chip in two (with 2 shorter latency rings with limited bandwidth and higher latency when crossing rings).But honestly, the detail levels like L3 latency variance between chip sizes is so *way* down in the noise at the point where most people start studying their latency behavior. You may get to actually caring about eventually, but you probably have 20 bigger fish to fry first.
1. Allocation: You simply cannot have any allocation in your main path. Be very careful of lambdas in Java 8. Some that you think should not allocate, do allocate. And don't get me started on APIs like NIO!!!
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Strings are also a particularly painful topic as far as the JVM is concerned: often forced into lots of unwanted copies and allocations. It's almost like you need to rebuild string operations from scratch on top of plain old byte arrays to do anything without a lot of spam.Maybe a silly question, but are folks doing high performance work with the JVM writing their own string handling routines or something?
Interesting that you bring up NIO too. We've certainly seen some very strange behavior with NIO allocating huge amounts of memory off-heap when left unchecked, but it never occurred to me there might be allocation-related performance issues there.
1. Allocation: You simply cannot have any allocation in your main path. Be very careful of lambdas in Java 8. Some that you think should not allocate, do allocate. And don't get me started on APIs like NIO!!!Love to hear more about what you have to say here, Martin. We've sort of discovered this the hard way. Anything on the hot path needs to be extremely conservative about allocations (at least on the JVM -- can't speak to native code, but I can imagine that malloc/free could hurt just as much as GC there). Considered pooling briefly in the past, but our working set is often large enough that it basically becomes a choice between potentially exhausting pools or cluttering old gen. Even without pooling, careful reuse of objects between iterations can be fraught with risk as any other allocations on the hot path can quietly shuttle any number of objects over to old gen. It's tough.
Strings are also a particularly painful topic as far as the JVM is concerned: often forced into lots of unwanted copies and allocations. It's almost like you need to rebuild string operations from scratch on top of plain old byte arrays to do anything without a lot of spam.Maybe a silly question, but are folks doing high performance work with the JVM writing their own string handling routines or something?
Interesting that you bring up NIO too. We've certainly seen some very strange behavior with NIO allocating huge amounts of memory off-heap when left unchecked, but it never occurred to me there might be allocation-related performance issues there.
+1. It doesn't help that the mainstream java community continues to endorse and promote the "allocations are cheap" myth.
Sent from my phone
--
On 28 October 2014 04:52, Tom Lee <m...@tomlee.co> wrote:1. Allocation: You simply cannot have any allocation in your main path. Be very careful of lambdas in Java 8. Some that you think should not allocate, do allocate. And don't get me started on APIs like NIO!!!Love to hear more about what you have to say here, Martin. We've sort of discovered this the hard way. Anything on the hot path needs to be extremely conservative about allocations (at least on the JVM -- can't speak to native code, but I can imagine that malloc/free could hurt just as much as GC there). Considered pooling briefly in the past, but our working set is often large enough that it basically becomes a choice between potentially exhausting pools or cluttering old gen. Even without pooling, careful reuse of objects between iterations can be fraught with risk as any other allocations on the hot path can quietly shuttle any number of objects over to old gen. It's tough.The summary is that if you are working in the 10s of microsecond space then any allocation will result in significant outliers regardless of JVM. Anyone in HFT is usually well aware of this issue.However the picture gets more interesting for general purpose applications. I get to profile quite a few applications in the field across a number of domains. The thing this has taught me is that the biggest performance improvements often come from simply reducing the allocation rate. A little bit of allocation profiling and some minor code changes can give big returns. I recommend to all my customers that that run a profiler regularly and keep allocation to modest levels, regardless of application type, and the returns are significant for minimal effort. Our current garbage collectors seem to be not up to the job of coping with the new multicore world and large memory servers - Zing excluded :-)Allocation is just as big an issue in the C/C++ world. In all worlds it is the reclamation rather than the allocation that is the issue. Just try allocating memory on one thread and freeing it on another in a native language and see how that performs. The big benefit we get in the native world is stack allocation. This has so many benefits besides cheap allocation and reclamation, it is also local in the hot OS page and does not suffer false sharing issues. In the multicore world I so miss stack allocation when using Java. It does not have to be a language feature, Escape Analysis could be better or an alternative like object explosion JRockit can help.
Just take the simple example of Selector. You do a selectNow() then you get a set of selected key set that you must iterate over. Why not just take a callback to selectNow() or pass in a collection to fill. This and the likes of String.split() are examples of brain dead API design that causes performance issues. Richard and I have worked on this likes of this and he has at least listened to my whines and is trying to do something about it.
The question always arises in my head - "Why are you using Strings?" If its because you want a human-readable data format then using a binary encoding which has a program which lets you pretty-print the encoding is just as good IMO and avoids a lot of these issues. If you're interacting with an external protocol which is text based I can appreciate that this isn't a decision you can make so easily.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
<< Putting on my contrarian "but allocations ARE cheap" hat. >>I disagree, guys. These notions of "allocation causes outliers" is bogus. Pauses and glitches are not an inherent side-effect of allocation. They are side effects of bad allocator implementations.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
[Some of this is cut/pasted from emails in which I had answered similar questions]A few clarification points before you answer the question themselves:You mentioned that throughput is not the main concern (<1K messages/sec expected), and that you want to minimize the latency between receiving an event and sending a response. But when you say "minimize the latency", do you mean:a) you want to minimize the best case latencyb) you want to minimize the median latencyc) you want to minimize the worst experienced latencyd) some other level (e.g. minimize the 99.9%'lie)The answers to some of your questions will vary depending on your answers. And the right answer is probably to have goals for at least three of the points above (b, c, and d at least).If all you care about is the best case and the median latency, you are probably best off with a single socket, highest clock rate, small core count setup with HT turned off. So what if your latency jumps to 10-20msec every once in a while? You'll great best case and median numbers to show off, and you said you don't care about high percentiles.However, if you care at all about you higher percentiles, you'll want to make sure to have enough cores to accommodate the peak number of runnable threads in your system, with some room to spare, and even then you may want to play some taskset, numactl, or isolcpu games. This is where higher core counts, dual socket setups, and turning HT on comes into play: they all help keep the high percentiles and outliers down (in both frequency and magnitude).
As for NICs. My experience is that burning a core or two on OpenOnload or the like (spinning with no back) is worth a couple of usec in the common case, and that this is especially true for non-saturated workloads (like your <1K/sec), where each latency is measured from a standstill (responding to a single event). If what you care about is latency (and not power, space, or saving a few $), burning those cores is a cheap way to get what you want. Same goes for the disruptor. Burning cores on controlled & dedicated spinning buys you better latency across the board, as long as you have plenty of cores to burn (it gets really bad when you don't have those cores).When it comes to comparing clock frequencies, I usually tell people to compare the max turbo boost frequencies, rather than the nominal ones. E.g. an E5-2697 V3 has 14 cores, 35MB of L3, and shows 2.6GHz nominal. An E5-2687W V3 has 10 cores, 25MB of L3, and shows 3.1Ghz nominal. And an E5-2667 V3 has 8 cores, 20MB of L3, and shows 3.2Ghz nominal. So you may think the 3.2GHz number is best. But if you look carefully, all three chips have the same Turbo Boost frequency (3.6Ghz), and chances are that with the same number of cores actually busy, they'll all be running at the same frequency. But the larger chip (E5-2697) has more cores, which can help with outliers and allow for more generous "burning" of cores. And it has a significantly bigger L3 which can help with everything.
When you are latency sensitive, your best hint as an initial filter is power envelope: you should be looking at and choosing between chips in the higher power envelope rankings (135-160W usually). The lower power chips will keep your frequencies lower. [Also note that desktop chips tend to be classified by their turbo boost, as in "up to 4.4GHz", while server chips focus on nominal freq. (2.6GHz rather than "up to 3.6GHz"), that probably marketing driven].
For the common single socket vs. dual socket question, and Haswell E vs E5, I usually lean towards the E5s. Yes, something like the i7-4790K has a higher clock rate: 4.0GHz nominal, boosting to 4.4Ghz, but it only has 4 physical cores, and with the disruptor and a good NIC stack, you'll be burning those cores down fast, and may have 10-20msec outliers to deal with quickly, causing you to prematurely give up on spinning, and start paying in common case latency for this lack or spinning cores. Another E5 benefit is memory capacity and bandwidth. i7s tend to peak at 32GB, and have only 2 memory channels, while E5s have 4 channels, support faster DRAMs (e.g. 2133 vs. 1600), and a lot more memory (typical config of 128GB per socket is commonplace in 2 socket servers today. E.g. that's what the latest EC2 machines seem to be built with).
As for 1 vs. 2 sockets: most 1 vs. 2 socket situations where the 1 socket system "wins" can be matched with nuactl and locking down workload (including their RAM) to one of the two sockets. 2 socket setups have the benefit of allowing you to separate background stuff (like the OS and rarely active processes or threads, and potentially your own non-crticial threads) from you critical workload. They do require more care, e.g. with things like IRQ assignment and making sure your packets are processed (in the network stack) in the same L3 that your consuming threads will be on, but at the end that extra capacity can be a lifesaver.Bottom line:When people ask me about a "good" system for low latency (and this happens a lot), I answer that if $$ or power are not the drivers, and latency is, my current recommendation is to go with a 2 socket E5 (with something like an E5-2697 V3 rightness), and 128GB-256GB of memory. The street price for that is somewhere between $8K and $12K ($3K-$4K of which is spent when focusing on the faster, higher core count Xeons). I also recommend investing in a good 10G NIC in (like Solarflare) which will add some to that price, but buy you a few more usec.
On Thursday, October 23, 2014 1:35:42 PM UTC-7, Michael Mattoss wrote:Hi all,
I wrote an application based around the Disruptor that receives market events and responses accordingly.
I'm trying to decide which hardware configuration I should purchase for the server hosting the app, that would minimize the latency between receiving an event and sending a response.
I should mention that the app receives < 1K messages/second so throughput is not much of an issue.
I tried to do some research but the amount of data/conflicting info is overwhelming so I was hoping some of the experts on this group could offer their insights.
How should I choose the right CPU type? Should I go with a Xeon E5/E7 for the large cache or should I favor a high speed CPU like the i7 4790K (4.4Ghz) since 99% of work is done in a single thread?
What about the new Haswell-E CPU's which seem to strike a good balance between cache size & core speed and also utilize DDR4 memory?
Does it matter if the memory configuration of a 16GB RAM for example is 4x4GB or 2x8GB?
Should I use an SSD HDD or a high performance (15K RPM) mechanical one? (the app runs entirely in memory of course and the BL thread is not I/O bound, but there's a substantial amount of data written sequentially to log files). How about a combination of the two (SDD for the OS and mechanical one for log files?
Is it worth investing in a high performance NIC such as those offered by Solarflare if OpenOnload (kernel bypass) is not used (just for the benefit of CPU offloading)?
Any help, suggestions and tips you may offer would be greatly appreciated.
Thank you!
Michael
This is consistent with my experience as well, both in java and .NET (not necessarily the same allocation rate Kirk observes, but the general notion).
There's a lot of work involved in stopping mutators, doing a GC, and restarting them (safepoint, stack walking, OS kernel calls, icache and dcache pollution, actual GC logic, etc). Even if collections are done concurrently, there're resources being taken away from app threads. Sure, if you have a giant machine where the GC can be fully segregated from app threads, maybe that's ok. But even then, I'd rather use more of that machine for app logic rather than overhead, so keeping allocations to a minimal is beneficial.
In native languages, people tend to care about memory allocation more (some of that is natural to the fact that, well, it's an unmanaged environment, but it's also due to being conscious of its performance implications), whereas in java and .NET it's a free-for-all; I blame that on the "allocations are cheap" mantra. Yes, an individual allocation is cheap, but not when you start pounding the system with them. Native apps tend to use arenas/pools for long-lived objects (especially ones with specific lifetimes, e.g. servicing a request), and stack memory for temporaries. Unfortunately, as we all know, there's no sure way to use stack memory (unless you want to destructure objects into method arguments, which is ugly and brittle at best).
Quite a few of the big well known "big data"/perf sensitive java projects have hit GC problems, leading to solutions similar to what people do in native land. I don't have experience with Azul's VM so maybe it really would solve most of these problems for majority cases. But personally, I'm greedy - I want more of my code running than the VM's! :)
Sent from my phone
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
There is an additional sentiment that I wish I understood. Seems that many people believe that parsing is easier with ASCII.
The internet running in debug mode is one of the most brilliant statements on the state of logging today that I’ve ever heard.. I want to quote you!!!
This is consistent with my experience as well, both in java and .NET (not necessarily the same allocation rate Kirk observes, but the general notion).
There's a lot of work involved in stopping mutators, doing a GC, and restarting them (safepoint, stack walking, OS kernel calls, icache and dcache pollution, actual GC logic, etc). Even if collections are done concurrently, there're resources being taken away from app threads. Sure, if you have a giant machine where the GC can be fully segregated from app threads, maybe that's ok. But even then, I'd rather use more of that machine for app logic rather than overhead, so keeping allocations to a minimal is beneficial.
In native languages, people tend to care about memory allocation more (some of that is natural to the fact that, well, it's an unmanaged environment, but it's also due to being conscious of its performance implications), whereas in java and .NET it's a free-for-all; I blame that on the "allocations are cheap" mantra. Yes, an individual allocation is cheap, but not when you start pounding the system with them. Native apps tend to use arenas/pools for long-lived objects (especially ones with specific lifetimes, e.g. servicing a request), and stack memory for temporaries. Unfortunately, as we all know, there's no sure way to use stack memory (unless you want to destructure objects into method arguments, which is ugly and brittle at best).
Quite a few of the big well known "big data"/perf sensitive java projects have hit GC problems, leading to solutions similar to what people do in native land. I don't have experience with Azul's VM so maybe it really would solve most of these problems for majority cases. But personally, I'm greedy - I want more of my code running than the VM's! :)
This rush to off-heap memory IMHO isn’t justified. I’ve recently been looking at Neo4J, Cassandra, Hazelcast and a whole host of other like technologies. Instead of solving the real problems that have in the implementation they’ve thrown caution to the wind and then gone off heap. The real problem they all have is that they are written is a very memory inefficient way. I bet that if they were to improve the memory efficiency that the motivation to go off-heap would be weak at best. Lets face it.. going off-heap is *cool* and has nothing to do with real requirements.
This rush to off-heap memory IMHO isn’t justified. I’ve recently been looking at Neo4J, Cassandra, Hazelcast and a whole host of other like technologies. Instead of solving the real problems that have in the implementation they’ve thrown caution to the wind and then gone off heap. The real problem they all have is that they are written is a very memory inefficient way. I bet that if they were to improve the memory efficiency that the motivation to go off-heap would be weak at best. Lets face it.. going off-heap is *cool* and has nothing to do with real requirements.Hmmm. I get to see many very real requirements to go off the Java heap. Inter Process Communication being just one.The products you list are all data stores. How would you operate a 100GB+ heap in Java without using Zing and cope with the GC pauses or enable memory prefetching and page locality? The only way I can see to do this is to stuff the data into primitive arrays but then that is just as awkward as going off heap.
The products you list are all data stores. How would you operate a 100GB+ heap in Java without using Zing and cope with the GC pauses or enable memory prefetching and page locality? The only way I can see to do this is to stuff the data into primitive arrays but then that is just as awkward as going off heap.
I've had allocation rates all over the spectrum, anywhere from 1GB to 200MB per second, and this is ranging from java server daemons to .NET gui apps. The rate which bogs down the app is, of course, dependent on machine class and application workload.
By the way, I'm not a proponent of "zero allocation at runtime at all" - I don't mind allocating where it makes sense, but that's almost always on slow/exceptional paths. Hot/fast paths, however, should not allocate. I'm not a masochist that likes to tune GC knobs - my "tuning" involves being a cheapskate with memory :).
All of this depends on the performance requirements of the system, obviously. If pausing for several tens/hundreds of millis and above every few seconds is ok, then sure, don't contort the code. If, however, we're talking about consistent performance in sub-millisecond range, then every little waste adds up and you get death by a thousand cuts. I'm sure we've all seen profiles (cpu + mem) where there's no big elephant in the room but rather a bunch of small leaches.
By the way, off-heap isn't the only solution I was referring to when I mentioned other projects hitting gc problems. Stackoverflow, for example, simply started using structs more (this is .net). Roslyn team (c# compiler written in c#) started reusing certain objects/pooling and avoiding subtle boxing in hot paths, etc. Cassandra went off-heap, yes, but they determined that their object lifetime just didn't jive well with how collectors want things to be in order to stay efficient.
My point isn't to be draconian here and not allocate at all (we're not talking about a microcontroller for a space rover here) but for people to stop using the "allocation is cheap" mantra in an effort to dismiss being wasteful. I'm convinced that this statement has caused more harm than good. People build libraries and allocate needlessly here and there, thinking "no big deal, these are short-lived, die young, don't get copied, don't contribute to young GC time, and I only do a handful of these, blah blah blah". Then you put tens of these libraries together to form some app, and bam, each one of those "isolated mindset designs", which aren't terrible if they were the only thing running, now bring the system to its knees or leave perf on the table for no good reason.
Gil's mechanical sympathy math is nice analysis, but it seems to paper over the cache implications a bit. Sure, cores can sustain memory throughput of GB/s but it's not like we're not using those channels for other things - I want to use as much of the resources towards my app logic, and not towards runtime infra/overhead. Object reuse/pooling doesn't have to be slower than allocation, I'm not sure why that's being made to sound like a fact. If we're talking about threadsafe pools that service multiple threads, sure, but that's not the only type of pool/use case; you can reuse objects for a single thread of execution.
Let's also not forget that if your allocation rate is dependent on load, then you may tune things for today's load and reach satisfactory latency and/or throughput, but if tomorrow the load changes, you're back to same exercise. If, however, the memory footprint is constant, then you don't need to worry about that (you may need to tweak other things in the system for higher load, but that's beside the point and always the case anyway).
I also wouldn't discount the effects of initiating a GC on your application performance. Every time it's triggered, all the execution resources are shifted to servicing that. As I said before, the data and instruction caches will be polluted; branch target buffers will be polluted; u-op or trace caches polluted; you got calls into kernel involving the scheduler; etc etc. Sure, we're talking about tens/hundreds of millis here but in those tens/hundreds of millis I could've had tens/hundreds of millions of my own instructions executed, per core! Let me reiterate: if you care about consistent sub-millisecond performance, all these things matter and add up. If your performance requirements are more lax, then maybe you don't. Luckily, as Gil pointed out, modern machines are flat out beasts so can dampen these types of things.
Sent from my phone
On Oct 31, 2014 7:48 AM, "Benedict Elliott Smith" <bellio...@datastax.com> wrote:
>
> Kirk,
>
> A quick response to your statement about Cassandra going off-heap, since I've done some of the recent work on that. There are two reasons for doing this, and neither are because it's cool.
>
> 1) Cassandra is intended for realtime scenarios where lengthy GC pauses are problematic - not everyone can run Zing, and a reduction in allocation without any change on any unrelated workload characteristics necessarily yields fewer GCs;
> 2) A majority of Cassandra's memory consumption is for many easily grouped, easily managed, relatively long-lived allocations. In this scenario you are wasting a persistent GC CPU burden walking object graphs you know aren't going to be collected. On top of which you're allocating/copying them multiple times, through each stage of the object lifecycle. This further disrupts the behaviour over the short lived objects, causing them to be promoted more often, resulting in a disproportionate proliferation of those lengthy full-GCs.
>
> That's not to say there aren't many other improvements to be made besides, nor that we haven't made many improvements for the on-heap memory characteristics, but I suspect you are responding only to the marketing material, not to the actual development work going on. I encourage you to get involved if you feel like you can make a contribution.
+1 -- same goes for HBase. In fact Intel is driving a lot of the work on HBase because without it we can't leverage high ram machines.
Todd
>
> On Fri, Oct 31, 2014 at 8:07 AM, Kirk Pepperdine <ki...@kodewerk.com> wrote:
>>
>>
>> On Oct 30, 2014, at 2:54 PM, Vitaly Davidovich <vit...@gmail.com> wrote:
>>
>>> This is consistent with my experience as well, both in java and .NET (not necessarily the same allocation rate Kirk observes, but the general notion).
>>>
>>>
>> I’m curious to know what is the threshold on rate that make you want to look at allocations?
>>>
>>> There's a lot of work involved in stopping mutators, doing a GC, and restarting them (safepoint, stack walking, OS kernel calls, icache and dcache pollution, actual GC logic, etc). Even if collections are done concurrently, there're resources being taken away from app threads. Sure, if you have a giant machine where the GC can be fully segregated from app threads, maybe that's ok. But even then, I'd rather use more of that machine for app logic rather than overhead, so keeping allocations to a minimal is beneficial.
>>>
>>>
>> Humm, I see huge benefits of lowering allocation rates even without GC being a problem. IOWs GC throughput can even be 99% and if the allocation rates are 1G (for example), working to push them down always brings a significant gain.
>>
>>> In native languages, people tend to care about memory allocation more (some of that is natural to the fact that, well, it's an unmanaged environment, but it's also due to being conscious of its performance implications), whereas in java and .NET it's a free-for-all; I blame that on the "allocations are cheap" mantra. Yes, an individual allocation is cheap, but not when you start pounding the system with them. Native apps tend to use arenas/pools for long-lived objects (especially ones with specific lifetimes, e.g. servicing a request), and stack memory for temporaries. Unfortunately, as we all know, there's no sure way to use stack memory (unless you want to destructure objects into method arguments, which is ugly and brittle at best).
>>>
>>> Quite a few of the big well known "big data"/perf sensitive java projects have hit GC problems, leading to solutions similar to what people do in native land. I don't have experience with Azul's VM so maybe it really would solve most of these problems for majority cases. But personally, I'm greedy - I want more of my code running than the VM's! :)
>>
>>
>> This rush to off-heap memory IMHO isn’t justified. I’ve recently been looking at Neo4J, Cassandra, Hazelcast and a whole host of other like technologies. Instead of solving the real problems that have in the implementation they’ve thrown caution to the wind and then gone off heap. The real problem they all have is that they are written is a very memory inefficient way. I bet that if they were to improve the memory efficiency that the motivation to go off-heap would be weak at best. Lets face it.. going off-heap is *cool* and has nothing to do with real requirements.
>>
>> Regards,
>> Kirk
>>
>
By the way, I'm not a proponent of "zero allocation at runtime at all" - I don't mind allocating where it makes sense, but that's almost always on slow/exceptional paths. Hot/fast paths, however, should not allocate. I'm not a masochist that likes to tune GC knobs - my "tuning" involves being a cheapskate with memory :).
My point isn't to be draconian here and not allocate at all (we're not talking about a microcontroller for a space rover here) but for people to stop using the "allocation is cheap" mantra in an effort to dismiss being wasteful. I'm convinced that this statement has caused more harm than good. People build libraries and allocate needlessly here and there, thinking "no big deal, these are short-lived, die young, don't get copied, don't contribute to young GC time, and I only do a handful of these, blah blah blah". Then you put tens of these libraries together to form some app, and bam, each one of those "isolated mindset designs", which aren't terrible if they were the only thing running, now bring the system to its knees or leave perf on the table for no good reason.
I've had allocation rates all over the spectrum, anywhere from 1GB to 200MB per second, and this is ranging from java server daemons to .NET gui apps. The rate which bogs down the app is, of course, dependent on machine class and application workload.
By the way, I'm not a proponent of "zero allocation at runtime at all" - I don't mind allocating where it makes sense, but that's almost always on slow/exceptional paths. Hot/fast paths, however, should not allocate. I'm not a masochist that likes to tune GC knobs - my "tuning" involves being a cheapskate with memory :).
All of this depends on the performance requirements of the system, obviously. If pausing for several tens/hundreds of millis and above every few seconds is ok, then sure, don't contort the code. If, however, we're talking about consistent performance in sub-millisecond range, then every little waste adds up and you get death by a thousand cuts. I'm sure we've all seen profiles (cpu + mem) where there's no big elephant in the room but rather a bunch of small leaches.
By the way, off-heap isn't the only solution I was referring to when I mentioned other projects hitting gc problems. Stackoverflow, for example, simply started using structs more (this is .net). Roslyn team (c# compiler written in c#) started reusing certain objects/pooling and avoiding subtle boxing in hot paths, etc. Cassandra went off-heap, yes, but they determined that their object lifetime just didn't jive well with how collectors want things to be in order to stay efficient.
My point isn't to be draconian here and not allocate at all (we're not talking about a microcontroller for a space rover here) but for people to stop using the "allocation is cheap" mantra in an effort to dismiss being wasteful. I'm convinced that this statement has caused more harm than good. People build libraries and allocate needlessly here and there, thinking "no big deal, these are short-lived, die young, don't get copied, don't contribute to young GC time, and I only do a handful of these, blah blah blah". Then you put tens of these libraries together to form some app, and bam, each one of those "isolated mindset designs", which aren't terrible if they were the only thing running, now bring the system to its knees or leave perf on the table for no good reason.
Gil's mechanical sympathy math is nice analysis, but it seems to paper over the cache implications a bit. Sure, cores can sustain memory throughput of GB/s but it's not like we're not using those channels for other things - I want to use as much of the resources towards my app logic, and not towards runtime infra/overhead. Object reuse/pooling doesn't have to be slower than allocation, I'm not sure why that's being made to sound like a fact. If we're talking about threadsafe pools that service multiple threads, sure, but that's not the only type of pool/use case; you can reuse objects for a single thread of execution.
Let's also not forget that if your allocation rate is dependent on load, then you may tune things for today's load and reach satisfactory latency and/or throughput, but if tomorrow the load changes, you're back to same exercise. If, however, the memory footprint is constant, then you don't need to worry about that (you may need to tweak other things in the system for higher load, but that's beside the point and always the case anyway).
I also wouldn't discount the effects of initiating a GC on your application performance. Every time it's triggered, all the execution resources are shifted to servicing that. As I said before, the data and instruction caches will be polluted; branch target buffers will be polluted; u-op or trace caches polluted; you got calls into kernel involving the scheduler; etc etc. Sure, we're talking about tens/hundreds of millis here but in those tens/hundreds of millis I could've had tens/hundreds of millions of my own instructions executed, per core! Let me reiterate: if you care about consistent sub-millisecond performance, all these things matter and add up. If your performance requirements are more lax, then maybe you don't. Luckily, as Gil pointed out, modern machines are flat out beasts so can dampen these types of things.
These "keep allocation rates down to 640KB/sec" (oh, right, you said 300MB/sec) guidelines are are purely driven by GC pausing behavior. Nothing else.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
HI Gil,These "keep allocation rates down to 640KB/sec" (oh, right, you said 300MB/sec) guidelines are are purely driven by GC pausing behavior. Nothing else.No, even when GC is working wonderfully I can still make gains in application performance by reducing allocation rates to approximately the levels mentioned. If the limit should be 20+GB/sec as you suggest then I’m just not seeing it in the field as I get disproportionate gains in performance by simply improving allocation efficiency. This is not a case of you are wrong and I am right. And, I don’t fully understand why, I only have ideas..... and I’d like to dig deeper but at the moment all I can tell you is this is what I’m seeing.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/jdIhW0TaZQ4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
My take on this is that simple GC benchmarks are just like simple CPU benchmarks: they can give you a false impression of what the costs would be when baked into a more complex application. For example, a simple benchmark may be able to hit in the cache for its entirety whereas real application may miss, and an algorithm that's perhaps less CPU optimized but touches less memory may outperform a CPU optimized one that touches more memory.
Higher thread count, all else being equal, stresses the OS scheduler more (and threads may end up migrating around, further inducing stalls), more frequent young GCs with all their costs outlined several times on this thread, more context switches, TLABs are possibly smaller and get retired quicker, etc etc. In "real" applications, allocations matter! But really, this is no different than other guidelines for performant code: cheapest code is one that's not called at all.
Sent from my phone
Lumpy.local-58% java -Xmx4g -Xms4g -Xmn2g -verbosegc -XX:+PrintGCTimeStamps -cp . AllocationRateExample -c 1 -t 15000
1.396: [GC (Allocation Failure) 1572864K->632K(3932160K), 0.0011270 secs]
2.083: [GC (Allocation Failure) 1573496K->680K(3932160K), 0.0009779 secs]
2.614: [GC (Allocation Failure) 1573544K->664K(3932160K), 0.0009521 secs]
3.275: [GC (Allocation Failure) 1573528K->664K(3932160K), 0.0007557 secs]
3.883: [GC (Allocation Failure) 1573528K->600K(3932160K), 0.0008720 secs]
4.483: [GC (Allocation Failure) 1573464K->664K(4193280K), 0.0012363 secs]
5.628: [GC (Allocation Failure) 2095768K->708K(4193280K), 0.0009738 secs]
6.423: [GC (Allocation Failure) 2095812K->644K(4193280K), 0.0004019 secs]
7.207: [GC (Allocation Failure) 2095748K->548K(4193280K), 0.0004225 secs]
7.933: [GC (Allocation Failure) 2095652K->580K(4193280K), 0.0006007 secs]
8.670: [GC (Allocation Failure) 2095684K->644K(4193280K), 0.0006047 secs]
9.506: [GC (Allocation Failure) 2095748K->644K(4193280K), 0.0005459 secs]
10.323: [GC (Allocation Failure) 2095748K->580K(4193280K), 0.0005144 secs]
11.130: [GC (Allocation Failure) 2095684K->644K(4193280K), 0.0004345 secs]
11.949: [GC (Allocation Failure) 2095748K->644K(4193280K), 0.0004276 secs]
12.752: [GC (Allocation Failure) 2095748K->548K(4193280K), 0.0004619 secs]
13.569: [GC (Allocation Failure) 2095652K->580K(4193280K), 0.0005234 secs]
14.386: [GC (Allocation Failure) 2095684K->612K(4193280K), 0.0008006 secs]
15.234: [GC (Allocation Failure) 2095716K->580K(4193280K), 0.0005469 secs]
16.031: [GC (Allocation Failure) 2095684K->612K(4193280K), 0.0004742 secs]
Done....
rhine-77% java -Xmx20g -Xms20g -Xmn10g -verbosegc -XX:+PrintGCTimeStamps -cp . AllocationRateExample -c 16 -t 15000
0.421: [GC 7864320K->680K(19660800K), 0.0042570 secs]
0.674: [GC 7865000K->808K(19660800K), 0.0034670 secs]
1.012: [GC 7865128K->840K(19660800K), 0.0035250 secs]
1.326: [GC 7865160K->760K(19660800K), 0.0021870 secs]
1.641: [GC 7865080K->872K(19660800K), 0.0019550 secs]
1.994: [GC 7865192K->680K(20970560K), 0.0026520 secs]
2.418: [GC 10484520K->560K(20969856K), 0.0017370 secs]
2.869: [GC 10484400K->528K(20970368K), 0.0013940 secs]
3.334: [GC 10484112K->496K(20969536K), 0.0017700 secs]
3.794: [GC 10484080K->656K(20970368K), 0.0013320 secs]
4.265: [GC 10484112K->496K(20970368K), 0.0016030 secs]
4.737: [GC 10483952K->528K(20970368K), 0.0016240 secs]
5.206: [GC 10483984K->496K(20970368K), 0.0023400 secs]
5.687: [GC 10483952K->464K(20970432K), 0.0021000 secs]
6.168: [GC 10483984K->592K(20970368K), 0.0016850 secs]
6.644: [GC 10484112K->528K(20970496K), 0.0013200 secs]
7.111: [GC 10484176K->656K(20970432K), 0.0016640 secs]
7.587: [GC 10484304K->464K(20970560K), 0.0015380 secs]
8.064: [GC 10484304K->464K(20970560K), 0.0025490 secs]
8.535: [GC 10484304K->496K(20970624K), 0.0022670 secs]
9.009: [GC 10484400K->656K(20970560K), 0.0026790 secs]
9.488: [GC 10484560K->560K(20970688K), 0.0014580 secs]
9.968: [GC 10484592K->624K(20970624K), 0.0008420 secs]
10.447: [GC 10484656K->624K(20970752K), 0.0015560 secs]
10.929: [GC 10484848K->560K(20970752K), 0.0014460 secs]
11.405: [GC 10484784K->720K(20970816K), 0.0013580 secs]
11.883: [GC 10485072K->560K(20970816K), 0.0015900 secs]
12.355: [GC 10484912K->528K(20970816K), 0.0008100 secs]
12.829: [GC 10484880K->528K(20970816K), 0.0011210 secs]
13.307: [GC 10484880K->496K(20970880K), 0.0025160 secs]
13.784: [GC 10484976K->496K(20970880K), 0.0023670 secs]
14.267: [GC 10484976K->592K(20970944K), 0.0009670 secs]
14.748: [GC 10485136K->592K(20970880K), 0.0016410 secs]
15.224: [GC 10485136K->592K(20970944K), 0.0033210 secs]
15.704: [GC 10485200K->560K(20970944K), 0.0015320 secs]
Done....
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/jdIhW0TaZQ4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Gil,
Am I right that you would expect
4. [on HotSpot] Higher promotion rates, thus higher Full GC frequency,
thus decrease in "speed"
?
>I'd recommend using 50-80GB heaps with 20+GB newgens
Not everybody can afford that. My notebook is just 16GiB (see example below).
>"is allocation (and not something else) responsible for slowdown in latency/throughput, and for inquired outliers."
Ultimately it would turn out that DNA of the code author is
responsible for the outliners, while "pure allocation" is fine.
Here's example of "tuning by allocation":
https://github.com/checkstyle/checkstyle/pull/136
Average allocation rate before tuning was 12GB/40sec == 0.3GB/sec.
It's minuscule.
It turns out that just a bit of tuning (e.g. removing non-required
.clone() calls like [2]) cuts the duration of "mvn verify" time in
half: 40 sec -> 20 sec.
Two-fold reduction of a end-to-end by tuning 0.3GB/sec allocations, so
your "expectation #1: speed (measured as e.g. median latency on
operations)" seems to fail for this particular checkstyle bug.
I have no idea if Zing (or using 50-80GB) would solve that problem
automagically, however I believe such requirements are pure overkill.
Checkstyle is used in lots of projects and you can't just install Zing
everywhere and supply it with a 50GB heap.
[2]: https://github.com/checkstyle/checkstyle/blob/master/src/main/java/com/puppycrawl/tools/checkstyle/api/FileText.java#L289
Vladimir
Two-fold reduction of a end-to-end by tuning 0.3GB/sec allocations, soyour "expectation #1: speed (measured as e.g. median latency onoperations)" seems to fail for this particular checkstyle bug.What you describe is more like "A significant improvement in my code that reduced runtime also resulted in a 2-fold reducing in allocation." Your code change managed to save a lot of cycles. Some of which happened to be spent in allocation. Because some of your waste happened to be in making non-required duplicates of objects that didn't need duplication.