Choosing hardware to minimize latency

3,510 views
Skip to first unread message

Michael Mattoss

unread,
Oct 23, 2014, 4:35:42 PM10/23/14
to mechanica...@googlegroups.com
Hi all,

I wrote an application based around the Disruptor that receives market events and responses accordingly.
I'm trying to decide which hardware configuration I should purchase for the server hosting the app, that would minimize the latency between receiving an event and sending a response.
I should mention that the app receives < 1K messages/second so throughput is not much of an issue.

I tried to do some research but the amount of data/conflicting info is overwhelming so I was hoping some of the experts on this group could offer their insights.

How should I choose the right CPU type? Should I go with a Xeon E5/E7 for the large cache or should I favor a high speed CPU like the i7 4790K (4.4Ghz) since 99% of work is done in a single thread?
What about the new Haswell-E CPU's which seem to strike a good balance between cache size & core speed and also utilize DDR4 memory?
Does it matter if the memory configuration of a 16GB RAM for example is 4x4GB or 2x8GB?
Should I use an SSD HDD or a high performance (15K RPM) mechanical one? (the app runs entirely in memory of course and the BL thread is not I/O bound, but there's a substantial amount of data written sequentially to log files). How about a combination of the two (SDD for the OS and mechanical one for log files?
Is it worth investing in a high performance NIC such as those offered by Solarflare if OpenOnload (kernel bypass) is not used (just for the benefit of CPU offloading)?

Any help, suggestions and tips you may offer would be greatly appreciated.

Thank you!
Michael

Dan Eloff

unread,
Oct 23, 2014, 5:53:20 PM10/23/14
to mechanica...@googlegroups.com
A couple of things:

SolarFlare NICs are really good, CloudFlare tested a bunch of NICs and found them to be without rival: http://blog.cloudflare.com/a-tour-inside-cloudflares-latest-generation-servers/

Haswell-E plus DDR4 is probably the way I'd go, the power savings should eventually pay for themselves, and the higher memory speeds will give a boost to most workloads. Pick something with a high clock speed, judging by your single-threaded workload. But keep in mind it's usually easy with the disruptor to pipeline the workload and utilize more cores. E.g. receive, deserialize, preprocess, postprocess requests on other threads.

SSDs are great, especially if you wait to fsync data to disk for reliability. However, know your write workload, keep in mind write amplification (small writes that are fsynced will write 128kb, or whatever the erase block size is.) In typical workloads SSds are just so much more reliable, and you really don't want to be dealing with a crashed hard drive at 4am if you can avoid it. The price / GB is good enough for most things these days.

Cheers,
Dan





--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gil Tene

unread,
Oct 24, 2014, 1:57:04 AM10/24/14
to
[Some of this is cut/pasted from emails in which I had answered similar questions]

A few clarification points before you answer the question themselves:

You mentioned that throughput is not the main concern (<1K messages/sec expected), and that you want to minimize the latency between receiving an event and sending a response. But when you say "minimize the latency", do you mean:
a) you want to minimize the best case latency
b) you want to minimize the median latency
c) you want to minimize the worst experienced latency
d) some other level (e.g. minimize the 99.9%'lie)

The answers to some of your questions will vary depending on your answers. And the right answer is probably to have goals for at least three of the points above (b, c, and d at least).

If all you care about is the best case and the median latency, you are probably best off with a single socket, highest clock rate, small core count setup with HT turned off. So what if your latency jumps to 10-20msec every once in a while? You'll great best case and median numbers to show off, and you said you don't care about high percentiles.

However, if you care at all about you higher percentiles, you'll want to make sure to have enough cores to accommodate the peak number of runnable threads in your system, with some room to spare, and even then you may want to play some taskset, numactl, or isolcpu games. This is where higher core counts, dual socket setups, and turning HT on comes into play: they all help keep the high percentiles and outliers down (in both frequency and magnitude).

As for NICs. My experience is that burning a core or two on OpenOnload or the like (spinning with no back) is worth a couple of usec in the common case, and that this is especially true for non-saturated workloads (like your <1K/sec), where each latency is measured from a standstill (responding to a single event). If what you care about is latency (and not power, space, or saving a few $), burning those cores is a cheap way to get what you want. Same goes for the disruptor. Burning cores on controlled & dedicated spinning buys you better latency across the board, as long as you have plenty of cores to burn (it gets really bad when you don't have those cores).

When it comes to comparing clock frequencies, I usually tell people to compare the max turbo boost frequencies, rather than the nominal ones. E.g. an E5-2697 V3 has 14 cores, 35MB of L3, and shows 2.6GHz nominal. An E5-2687W V3 has 10 cores, 25MB of L3, and shows 3.1Ghz nominal. And an E5-2667 V3 has 8 cores, 20MB of L3, and shows 3.2Ghz nominal. So you may think the 3.2GHz number is best. But if you look carefully, all three chips have the same Turbo Boost frequency (3.6Ghz), and chances are that with the same number of cores actually busy, they'll all be running at the same frequency. But the larger chip (E5-2697) has more cores, which can help with outliers and allow for more generous "burning" of cores. And it has a significantly bigger L3 which can help with everything.

When you are latency sensitive, your best hint as an initial filter is power envelope: you should be looking at and choosing between chips in the higher power envelope rankings (135-160W usually). The lower power chips will keep your frequencies lower. [Also note that desktop chips tend to be classified by their turbo boost, as in "up to 4.4GHz", while server chips focus on nominal freq. (2.6GHz rather than "up to 3.6GHz"), that probably marketing driven].

For the common single socket vs. dual socket question, and Haswell E vs E5, I usually lean towards the E5s. Yes, something like the i7-4790K has a higher clock rate: 4.0GHz nominal, boosting to 4.4Ghz, but it only has 4 physical cores, and with the disruptor and a good NIC stack, you'll be burning those cores down fast, and may have 10-20msec outliers to deal with quickly, causing you to prematurely give up on spinning, and start paying in common case latency for this lack or spinning cores. Another E5 benefit is memory capacity and bandwidth. i7s tend to peak at 32GB, and have only 2 memory channels, while E5s have 4 channels, support faster DRAMs (e.g. 2133 vs. 1600), and a lot more  memory (typical config of 128GB per socket is commonplace in 2 socket servers today. E.g. that's what the latest EC2 machines seem to be built with).

As for 1 vs. 2 sockets: most 1 vs. 2 socket situations where the 1 socket system "wins" can be matched with nuactl and locking down workload (including their RAM) to one of the two sockets. 2 socket setups have the benefit of allowing you to separate background stuff (like the OS and rarely active processes or threads, and potentially your own non-crticial threads) from you critical workload. They do require more care, e.g. with things like IRQ assignment and making sure your packets are processed (in the network stack) in the same L3 that your consuming threads will be on, but at the end that extra capacity can be a lifesaver. 

Bottom line:

When people ask me about a "good" system for low latency (and this happens a lot), I answer that if $$ or power are not the drivers, and latency is, my current recommendation is to go with a 2 socket E5 (with something like an E5-2697 V3 rightness), and 128GB-256GB of memory. The street price for that is somewhere between $8K and $12K ($3K-$4K of which is spent when focusing on the faster, higher core count Xeons). I also recommend investing in a good 10G NIC in (like Solarflare) which will add some to that price, but buy you a few more usec.

Martin Thompson

unread,
Oct 24, 2014, 2:53:35 AM10/24/14
to mechanica...@googlegroups.com
On 24 October 2014 06:57, Gil Tene <g...@azulsystems.com> wrote:
[Some of this is cut/pasted from emails in which I had answered similar questions]

A few clarification points before you answer the question themselves:

You mentioned that throughput is not the main concern (<1K messages/sec expected), and that you want to minimize the latency between receiving an event and sending a response. But when you say "minimize the latency", do you mean:
a) you want to minimize the best case latency
b) you want to minimize the median latency
c) you want to minimize the worst experienced latency
d) some other level (e.g. minimize the 99.9%'lie)

The answers to some of your questions will vary depending on your answers. And the right answer is probably to have goals for at least three of the points above (b, c, and d at least).

If all you care about is the best case and the median latency, you are probably best off with a single socket, highest clock rate, small core count setup with HT turned off. So what if your latency jumps to 10-20msec every once in a while? You'll great best case and median numbers to show off, and you said you don't care about high percentiles.

However, if you care at all about you higher percentiles, you'll want to make sure to have enough cores to accommodate the peak number of runnable threads in your system, with some room to spare, and even then you may want to play some taskset, numactl, or isolcpu games. This is where higher core counts, dual socket setups, and turning HT on comes into play: they all help keep the high percentiles and outliers down (in both frequency and magnitude).

As for NICs. My experience is that burning a core or two on OpenOnload or the like (spinning with no back) is worth a couple of usec in the common case, and that this is especially true for non-saturated workloads (like your <1K/sec), where each latency is measured from a standstill (repos ending to a single event). If what you care about is latency (and not power, space, or saving a few $), burning those cores is a cheap way to get what you want. Same goes for the disruptor. Burning cores on controlled & dedicated spinning buys you better latency across the board, as long as you have plenty of cores to burn (it gets really bad when you don't have those cores).

When it comes to comparing clock frequencies, I usually tell people to compare the max turbo boost frequencies, rather than the nominal ones. E.g. an E5-2697 V3 has 14 cores, 35MB of L3, and shows 2.6GHz nominal. An E5-2687W V3 has 10 cores, 25MB of L3, and shows 3.1Ghz nominal. And an E5-2667 V3 has 8 cores, 20MB of L3, and shows 3.2Ghz nominal. So you may think the 3.2GHz number is best. But if you look carefully, all three chips have the same Turbo Boost frequency (3.6Ghz), and chances are that with the same number of cores actually busy, they'll all be running at the same frequency. But the larger chip (E5-2697) has more cores, which can help with outliers and allow for more generous "burning" of cores. And it has a significantly bigger L3m which can help with everything.

When you are latency sensitive, your best hint as an initial filter is power envelope: you should be looking at and choosing between chips in the higher power envelope rankings (135-160W usually). The lower power chips will keep your frequencies lower. [Also note that desktop chips tend to be classified by their turbo boost, as in "up to 4.4GHz", while server chips focus on nominal freq. (2.6GHz rather than "up to 3.6GHz"), that probably marketing driven].

For the common single socket vs. dual socket question, and Haswell E vs E5, I usually lean towards the E5s. Yes, something like the i7-4790K has a higher clock rate: 4.0GHz nominal, boosting to 4.4Ghz, but it only has 4 physical cores, and with the disruptor and a good NIC stack, you'll be burning those cores down fast, and may have 10-20msec outliers to deal with quickly, causing you to prematurely give up on spinning, and start paying in common case latency for this lack or spinning cores. Another E5 is memory capacity and bandwidth. i7s tend to leak at 32GB, and have only 2 memory channels, while E5s have 4 channels, support faster DRAMs (e.g. 2133 vs. 1600), and a lot more  memory (typical config of 128GB per socket is commonplace in 2 socket servers today. E.g. that's what the latest EC2 machines seem to be built with).

As for 1 vs. 2 sockets: most 1 vs. 2 socket situations where the 1 socket system "wins" can be matched with nuactl and locking down workload (including their RAM) to one of the two sockets. 2 socket setups have the benefit of allowing you to separate background stuff (like the OS and rarely active processes or threads, and potentially your own non-crticial threads) from you critical workload. They do require more care, e.g. with things like IRQ assignment and making sure your packets are processed (in the network stack) in the same L3 that your consuming threads will be on, but at the end that extra capacity can be a lifesaver. 

Bottom line:

When people ask me about a "good" system for low latency (and this happens a lot), I answer that if $$ or power are not the drivers, and latency is, my current recommendation is to go with a 2 socket E5 (with something like an E5-2697 V3 rightness), and 128GB-256GB of memory. The street price for that is somewhere between $8K and $12K ($3K-$4K of which is spent when focusing on the faster, higher core count Xeons). I also recommend investing in a good 10G NIC in (like Solarflare) which will add some to that price, but buy you a few more usec.

+1 A lot of good advice here.

I'd add to the point of getting a dual socket E5. You want to avoid L3 cache pollution from the OS and other processes on your machine. You can do this by isocups the second socket and run your app there. The Intel L3 cache is inclusive of L1 and L2, so another core can cause your data/code to be evicted from the L3 and thus your private L1/L2, even when threads are pinned to cores. Classic cold start problem after a quite period.

You need to strike a balance between having sufficient cores to *always* be able to run a thread when needed vs the larger L3 caches tend to have higher latencies, but this also needs to consider your working set size in L3. Be careful that a microbenchmark might suggest a smaller L3 is better because your working set is small for the microbenchmark that does radically change in a a real system requiring a larger L3 working set. 

I can vouch for the Solarflare network cards too but also consider Mellanox. In some benchmarks I've seen them be a touch faster again.

If money is no option then consider immersed liquid cooling and over clocking with Turbo always turned up.

Jean-Philippe BEMPEL

unread,
Oct 24, 2014, 3:52:25 AM10/24/14
to mechanica...@googlegroups.com
+1 with Gil, we are also in low latency space with low throughput and this is exactly what we are doing.


On Friday, October 24, 2014 7:57:04 AM UTC+2, Gil Tene wrote:
[Some of this is cut/pasted from emails in which I had answered similar questions]

A few clarification points before you answer the question themselves:

You mentioned that throughput is not the main concern (<1K messages/sec expected), and that you want to minimize the latency between receiving an event and sending a response. But when you say "minimize the latency", do you mean:
a) you want to minimize the best case latency
b) you want to minimize the median latency
c) you want to minimize the worst experienced latency
d) some other level (e.g. minimize the 99.9%'lie)

The answers to some of your questions will vary depending on your answers. And the right answer is probably to have goals for at least three of the points above (b, c, and d at least).

If all you care about is the best case and the median latency, you are probably best off with a single socket, highest clock rate, small core count setup with HT turned off. So what if your latency jumps to 10-20msec every once in a while? You'll great best case and median numbers to show off, and you said you don't care about high percentiles.

However, if you care at all about you higher percentiles, you'll want to make sure to have enough cores to accommodate the peak number of runnable threads in your system, with some room to spare, and even then you may want to play some taskset, numactl, or isolcpu games. This is where higher core counts, dual socket setups, and turning HT on comes into play: they all help keep the high percentiles and outliers down (in both frequency and magnitude).

As for NICs. My experience is that burning a core or two on OpenOnload or the like (spinning with no back) is worth a couple of usec in the common case, and that this is especially true for non-saturated workloads (like your <1K/sec), where each latency is measured from a standstill (responding to a single event). If what you care about is latency (and not power, space, or saving a few $), burning those cores is a cheap way to get what you want. Same goes for the disruptor. Burning cores on controlled & dedicated spinning buys you better latency across the board, as long as you have plenty of cores to burn (it gets really bad when you don't have those cores).

When it comes to comparing clock frequencies, I usually tell people to compare the max turbo boost frequencies, rather than the nominal ones. E.g. an E5-2697 V3 has 14 cores, 35MB of L3, and shows 2.6GHz nominal. An E5-2687W V3 has 10 cores, 25MB of L3, and shows 3.1Ghz nominal. And an E5-2667 V3 has 8 cores, 20MB of L3, and shows 3.2Ghz nominal. So you may think the 3.2GHz number is best. But if you look carefully, all three chips have the same Turbo Boost frequency (3.6Ghz), and chances are that with the same number of cores actually busy, they'll all be running at the same frequency. But the larger chip (E5-2697) has more cores, which can help with outliers and allow for more generous "burning" of cores. And it has a significantly bigger L3 which can help with everything.

When you are latency sensitive, your best hint as an initial filter is power envelope: you should be looking at and choosing between chips in the higher power envelope rankings (135-160W usually). The lower power chips will keep your frequencies lower. [Also note that desktop chips tend to be classified by their turbo boost, as in "up to 4.4GHz", while server chips focus on nominal freq. (2.6GHz rather than "up to 3.6GHz"), that probably marketing driven].

For the common single socket vs. dual socket question, and Haswell E vs E5, I usually lean towards the E5s. Yes, something like the i7-4790K has a higher clock rate: 4.0GHz nominal, boosting to 4.4Ghz, but it only has 4 physical cores, and with the disruptor and a good NIC stack, you'll be burning those cores down fast, and may have 10-20msec outliers to deal with quickly, causing you to prematurely give up on spinning, and start paying in common case latency for this lack or spinning cores. Another E5 benefit is memory capacity and bandwidth. i7s tend to peak at 32GB, and have only 2 memory channels, while E5s have 4 channels, support faster DRAMs (e.g. 2133 vs. 1600), and a lot more  memory (typical config of 128GB per socket is commonplace in 2 socket servers today. E.g. that's what the latest EC2 machines seem to be built with).

As for 1 vs. 2 sockets: most 1 vs. 2 socket situations where the 1 socket system "wins" can be matched with nuactl and locking down workload (including their RAM) to one of the two sockets. 2 socket setups have the benefit of allowing you to separate background stuff (like the OS and rarely active processes or threads, and potentially your own non-crticial threads) from you critical workload. They do require more care, e.g. with things like IRQ assignment and making sure your packets are processed (in the network stack) in the same L3 that your consuming threads will be on, but at the end that extra capacity can be a lifesaver. 

Bottom line:

When people ask me about a "good" system for low latency (and this happens a lot), I answer that if $$ or power are not the drivers, and latency is, my current recommendation is to go with a 2 socket E5 (with something like an E5-2697 V3 rightness), and 128GB-256GB of memory. The street price for that is somewhere between $8K and $12K ($3K-$4K of which is spent when focusing on the faster, higher core count Xeons). I also recommend investing in a good 10G NIC in (like Solarflare) which will add some to that price, but buy you a few more usec.
On Thursday, October 23, 2014 1:35:42 PM UTC-7, Michael Mattoss wrote:

Michael Mattoss

unread,
Oct 24, 2014, 10:44:53 AM10/24/14
to mechanica...@googlegroups.com
Thank you all for your replies and insights!
I would also like to take this opportunity to thank Martin and Gil for developing, and more importantly sharing, the Disruptor and HDRHistogram libraries!

Regarding the latency requirements, I'm mostly concerned with minimizing the 99.9%'lie latency.
Unfortunately, due to budget constraints (somewhere around $5K), dual socket server doesn't look like a viable option. I'm looking into Xeon E5 v3 and Haswell-E processors and have narrowed down the list of possible candidates but I need to do more research.

I have a few more questions:
1. What are pros & cons of disabling HT (other than the obvious reduction of logical cores)?
2. Does it make sense to enable HT to increase the total number of available cores but to isolate some of the physical cores and assigning them only 1 thread so that thread does share the physical core with any other threads?
3. Is there a rule of thumb on how to balance cache size against core speed?

Thank you,
Michael

Gil Tene

unread,
Oct 24, 2014, 11:31:38 AM10/24/14
to mechanica...@googlegroups.com
Budgeting tip: 

Always ask yourself this question: Is saving $5K to $10K in hardware budget worth 1 month of time to market and 1 month of engineering time?

- If the answer is yes, your application isn't worth much. I.e. [by definition] it's value is less than ($5K to $10K per month, minus cost of 1 month of engineering). And your engineers aren't being paid very well.

- If the answer is "That extra $5K in hardware won't buy me that that engineering time and time-to-market" (which is very reasonable for many apps) then what that answer really means your application's probably doesn't care about the 10-20% difference in the lower latency paths that the hardware will cheaply get you, and you can live fine in a $5K two socket server (e.g. E5-2640 V3) with 128GB of memory (that's the true commodity point for servers that are not extremely speed or latency sensitive these days).

If you are going to spend the time to get the low latency parts right, your engineering-hours budget involved in making latency behave well will far outrun any amount of $$ you spend on better commodity hardware. That's true unless you intend to deploy on several 100s of these machines, and even then I'd bet the engineering efforts would cost more than the hardware). You should expect to spend more on engineering than on hardware regardless. But when dealing with low latency, spending an extra $5K-$10K on your hardware to save yourself the time, cost, pain, and duct-tape involved in engineering your workload to fit into a low budget box ALWAYS makes sense.

Jean-Philippe BEMPEL

unread,
Oct 24, 2014, 4:37:23 PM10/24/14
to mechanica...@googlegroups.com
I do not know what are your numbers for low latency and how much you are sensitive to that, but for me as around 100us, It is _vital_ to have 2 sockets because if I have only one socket my latencies literally _double_ (200us) because of the L3 cache pollution due to administrative threads and other Linux processes...
99.9% is also pretty aggressive, specially if you have low throughput.

just my 2 cents.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/jdIhW0TaZQ4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Martin Thompson

unread,
Oct 24, 2014, 5:51:56 PM10/24/14
to mechanica...@googlegroups.com
+1

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Message has been deleted

Gil Tene

unread,
Oct 24, 2014, 7:18:00 PM10/24/14
to mechanica...@googlegroups.com
I couldn't resist. On my browser, Google groups shows this for Martin's last message:
 

+1

To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Steve Morin

unread,
Oct 25, 2014, 11:40:53 AM10/25/14
to mechanica...@googlegroups.com
What's isocups? Saw your reference to it but google didn't come up with much.

Georges Gomes

unread,
Oct 25, 2014, 11:50:34 AM10/25/14
to mechanica...@googlegroups.com
Isolcpu
Isolation of CPU
Google should better answer to that :)

On Sat, Oct 25, 2014, 17:40 Steve Morin <steve...@gmail.com> wrote:
What's isocups? Saw your reference to it but google didn't come up with much.

Martin Thompson

unread,
Oct 25, 2014, 11:50:50 AM10/25/14
to mechanica...@googlegroups.com

On 25 October 2014 16:40, Steve Morin <steve...@gmail.com> wrote:
What's isocups? Saw your reference to it but google didn't come up with much.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Karlis Zigurs

unread,
Oct 25, 2014, 11:54:49 AM10/25/14
to mechanica...@googlegroups.com
Don't forget to use taskset (or numactl) to launch the process on the 'freed' cores/socket afterwards as well.
Also it may be useful to tune garbage collector threads so that they match the dedicated core count (any real world experiences around this would be interesting to hear).

K

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Michael Mattoss

unread,
Oct 25, 2014, 2:24:29 PM10/25/14
to mechanica...@googlegroups.com
Thanks Gil, you raise some good points. I will have to reconsider the hardware budget.
Could you please answer the questions in my previous post regarding enabling/disabling HT and how does one weights cache size against core speed?

Thank you,
Michael

Michael Mattoss

unread,
Oct 25, 2014, 2:39:28 PM10/25/14
to mechanica...@googlegroups.com, jean-p...@bempel.fr
These numbers seem a bit high. At what percentile do you get to 100us latency?
Regarding the 99.9%'lie, I actually think quite the opposite: the less messages you have (such in low throughput environment), the greater the impact of responding late to a message.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Michael Mattoss

unread,
Oct 25, 2014, 2:43:41 PM10/25/14
to mechanica...@googlegroups.com
Out of curiosity, is there an equivalent to isolcpus in the Windows world?


On Saturday, October 25, 2014 6:50:50 PM UTC+3, Martin Thompson wrote:
On 25 October 2014 16:40, Steve Morin <steve...@gmail.com> wrote:
What's isocups? Saw your reference to it but google didn't come up with much.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Jean-Philippe BEMPEL

unread,
Oct 25, 2014, 3:54:29 PM10/25/14
to mechanica...@googlegroups.com
Michael, 

For HT the thing is if the 2 threads shared data in L1 and L2 you are fine otherwise threads are polluting each other pulling lines from L3.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/jdIhW0TaZQ4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Jean-Philippe BEMPEL

unread,
Oct 25, 2014, 4:08:03 PM10/25/14
to mechanica...@googlegroups.com
Consider we get 100us for 50%

The thing is at low rate cache are not very warm. This is why thread affinity and isolcpus are mandatory. 
If you increase throughput you generally observe better latencies. 

For percentiles this is a question of math : it depends the number of measures. For me at 99.9 Gc impact this percentile. 

For us at 99% as we have a low allocation rate minor Gc are not impacting our measures. (this is also because it includes coordinated omissions). 


On Saturday, October 25, 2014, Michael Mattoss <michael...@gmail.com> wrote:
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Michael Mattoss

unread,
Oct 25, 2014, 6:03:45 PM10/25/14
to mechanica...@googlegroups.com, jean-p...@bempel.fr
Perhaps I'm missing something here, but if you warm up the cache before the system switches to steady-state phase and you use thread affinity and isolcpus, why would the cache gets cold, even when the message rate is low? Does the CPU evict cache lines based on time?
As for GC's, I'm happy to say that I don't need to worry about them. I put a lot of work in designing & implementing my system so it would not allocate any memory during steady-state phase. Instead, it allocates all the memory it would need upfront (during the initialization phase) and afterwards, it just recycles objects through pools.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Martin Thompson

unread,
Oct 25, 2014, 7:12:34 PM10/25/14
to mechanica...@googlegroups.com
On 25 October 2014 23:03, Michael Mattoss <michael...@gmail.com> wrote:
Perhaps I'm missing something here, but if you warm up the cache before the system switches to steady-state phase and you use thread affinity and isolcpus, why would the cache gets cold, even when the message rate is low? Does the CPU evict cache lines based on time?

Intel run an inclusive cache policy whereby the L3 cache contains all the a lines in the private L1 and L2s for the same socket. Therefore if anything runs on that socket then the L3 cache will be impacted.  ISOCPUS and pinning does help but is not perfect, ssh will still run on all cores to gather entropy even when cores have been isolated. You also need to be aware that higher power saving states can evict various buffers and caches, e.g. the branch prediction, L0 instruction cache, and decode buffers etc just for the core alone, never mind the L1 and L2 caches.

If you are running a JVM then there are a number of things to watch out for like RMI doing a system GC to support distributed GC whether you need it or not!
 

Gil Tene

unread,
Oct 27, 2014, 1:01:17 AM10/27/14
to mechanica...@googlegroups.com
1. What are pros & cons of disabling HT (other than the obvious reduction of logical cores)?

The cons is loss of logical cores. That one is huge (extremely non-linear), which is why I always recommend people start with HT on, and only try turn it off late in the game, and to very carefully compare the pre/post situation with regard to outliers (i.e. compare multi-hour runs, not those silly 5 minute u-bench things). A "rare" 20msec spike in runnable threads that happens only once per hour can do a lot of damage, and HT on/off *could* make the difference between that spike killing you with outliers and not.

The pro varies a lot. A non-HT core has various "betterness" levels that depend on the core technology (e.g. Haskell vs. Westmere) and on your workload. For resources that are "dynamically partitioned" between HTs on a core, an idle HT has literally no effect. But there are resources that are are statically partitioned when HT is on, and halving those can result in various slowdown effects. E.g. halving the first level TLB (as was the case on some earlier cores) will effect you, but the effect would depend on ($K and 2M page level) locality of access. Branch prediction resources, reservation slots, etc. are also statically partitioned in some cores, and similarly could have an effect that varies form 0 to a lot...

Best thing to do is measure. Measure the effect on *your* common case latency, on *your* outliers, and on *your* throughput. [And report back on what you find]  
 
2. Does it make sense to enable HT to increase the total number of available cores but to isolate some of the physical cores and assigning them only 1 thread so that thread does share the physical core with any other threads?

You can certainly do that, but only if you were already planning to use isolcpus. When you use isolcpus and static thread-t-cpu assignment, keeping neighbors away from specific threads by "burning" 2 HTs for the one thread can be useful. But in a low load
system (like yours), the scheduler will tend to keep stuff spread to the physical cores anyway (again, the benefit of having chips with plenty of cores).
 
3. Is there a rule of thumb on how to balance cache size against core speed?

My rule of thumb is that cache is always better than core speed if you have non-i/o related cache misses. This obviously reverses at some point (At an L3 miss rate of 0.00001% I may go with core speed).

The harder tradeoff is cache size vs. L3 latency. Because the L3 sits in a a ring in modern Xeons, the larger the L3 is the more hops there are in the ring, and the higher the L3 latency can get to some (random) addresses. It gets more complicated, too. E.g. on the newer Haswell chips, going above a certain number of cores may step you into a longer-latency huge L3 ring, or may force you to partition the chip in two (with 2 shorter latency rings with limited bandwidth and higher latency when crossing rings).

But honestly, the detail levels like L3 latency variance between chip sizes is so *way* down in the noise at the point where most people start studying their latency behavior. You may get to actually caring about eventually, but you probably have 20 bigger fish to fry first.

Martin Thompson

unread,
Oct 27, 2014, 4:06:43 AM10/27/14
to mechanica...@googlegroups.com

On 27 October 2014 05:01, Gil Tene <g...@azulsystems.com> wrote:

3. Is there a rule of thumb on how to balance cache size against core speed?

My rule of thumb is that cache is always better than core speed if you have non-i/o related cache misses. This obviously reverses at some point (At an L3 miss rate of 0.00001% I may go with core speed).

The harder tradeoff is cache size vs. L3 latency. Because the L3 sits in a a ring in modern Xeons, the larger the L3 is the more hops there are in the ring, and the higher the L3 latency can get to some (random) addresses. It gets more complicated, too. E.g. on the newer Haswell chips, going above a certain number of cores may step you into a longer-latency huge L3 ring, or may force you to partition the chip in two (with 2 shorter latency rings with limited bandwidth and higher latency when crossing rings).

But honestly, the detail levels like L3 latency variance between chip sizes is so *way* down in the noise at the point where most people start studying their latency behavior. You may get to actually caring about eventually, but you probably have 20 bigger fish to fry first.

I've been spending a good chunk of this year building a low-latency system and lately been tracking the major causes of latency during the profiling phase. The top three causes of latency if you are working in 5-50us range are:

1. Allocation: You simply cannot have any allocation in your main path. Be very careful of lambdas in Java 8. Some that you think should not allocate, do allocate. And don't get me started on APIs like NIO!!!

2. Resource Contention: You always need to have more cores available than the number of threads that need to run. HT helps somewhat but is really a poor person's alternative to having sufficient real cores. You often need more cores than your think.

3. Wait-Free vs Lock-Free algorithms: When a controlling thread takes an interrupt then other threads involved in the same algorithm cannot make progress. Reworking core algorithms to be wait-free, in addition to lock-free, made a much better improvement to the long tail than I expected - even in the case of not having applied ISOCPUs or thread pinning. We are explicitly designing our system so that it behaves as well as possible on a vanilla distribution, as well as when pinning and other tricks are employed.

These sorts of things matter more than the size of your L3 cache. The difference in latencies between different L3 caches sizes can easily be traded off by just slightly reducing the amount of pointer chasing you do in your code.

Michael Mattoss

unread,
Oct 27, 2014, 5:47:43 PM10/27/14
to mechanica...@googlegroups.com
I would like to thank everyone for sharing their knowledge and insights. Much appreciated!
Hopefully, I'll be able to share some data in a few weeks.

Thanks again,
Michael

Tom Lee

unread,
Oct 28, 2014, 12:52:54 AM10/28/14
to mechanica...@googlegroups.com
1. Allocation: You simply cannot have any allocation in your main path. Be very careful of lambdas in Java 8. Some that you think should not allocate, do allocate. And don't get me started on APIs like NIO!!!

Love to hear more about what you have to say here, Martin. We've sort of discovered this the hard way. Anything on the hot path needs to be extremely conservative about allocations (at least on the JVM -- can't speak to native code, but I can imagine that malloc/free could hurt just as much as GC there). Considered pooling briefly in the past, but our working set is often large enough that it basically becomes a choice between potentially exhausting pools or cluttering old gen. Even without pooling, careful reuse of objects between iterations can be fraught with risk as any other allocations on the hot path can quietly shuttle any number of objects over to old gen. It's tough.

Strings are also a particularly painful topic as far as the JVM is concerned: often forced into lots of unwanted copies and allocations. It's almost like you need to rebuild string operations from scratch on top of plain old byte arrays to do anything without a lot of spam.

Maybe a silly question, but are folks doing high performance work with the JVM writing their own string handling routines or something?

Interesting that you bring up NIO too. We've certainly seen some very strange behavior with NIO allocating huge amounts of memory off-heap when left unchecked, but it never occurred to me there might be allocation-related performance issues there.

Cheers,
Tom


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Tom Lee http://tomlee.co / @tglee

Richard Warburton

unread,
Oct 28, 2014, 4:50:36 AM10/28/14
to mechanica...@googlegroups.com
Hi,

Strings are also a particularly painful topic as far as the JVM is concerned: often forced into lots of unwanted copies and allocations. It's almost like you need to rebuild string operations from scratch on top of plain old byte arrays to do anything without a lot of spam.

Maybe a silly question, but are folks doing high performance work with the JVM writing their own string handling routines or something?

I've proposed a patch to String on core libs which lets you more cheaply encode and decode them from/to ByteBuffer and subsequences of byte[]. Its not accepted yet, but the response seemed to be positive. So hopefully Java 9 will be a bit better in this regard. It would be nice to add similar functionality to StringBuffer/StringBuilder as well.

It would be nice to be able to implement a CharSequence that wraps some underlying source of bytes but the problem with this kind of approach is that a lot of APIs take String over CharSequence. So its not that you just end up reimplementing String - you also end up reimplementing a lot of other stuff.

The question always arises in my head - "Why are you using Strings?" If its because you want a human-readable data format then using a binary encoding which has a program which lets you pretty-print the encoding is just as good IMO and avoids a lot of these issues. If you're interacting with an external protocol which is text based I can appreciate that this isn't a decision you can make so easily.

Interesting that you bring up NIO too. We've certainly seen some very strange behavior with NIO allocating huge amounts of memory off-heap when left unchecked, but it never occurred to me there might be allocation-related performance issues there.

One of the specific on-heap NIO related allocate hot-spots seems to be in Selectors, which have to update an internal Hashset inside select(). If you're processing a lot of selects then this can problematic.

regards,

  Richard Warburton

Martin Thompson

unread,
Oct 28, 2014, 5:38:32 AM10/28/14
to mechanica...@googlegroups.com
On 28 October 2014 04:52, Tom Lee <m...@tomlee.co> wrote:
1. Allocation: You simply cannot have any allocation in your main path. Be very careful of lambdas in Java 8. Some that you think should not allocate, do allocate. And don't get me started on APIs like NIO!!!

Love to hear more about what you have to say here, Martin. We've sort of discovered this the hard way. Anything on the hot path needs to be extremely conservative about allocations (at least on the JVM -- can't speak to native code, but I can imagine that malloc/free could hurt just as much as GC there). Considered pooling briefly in the past, but our working set is often large enough that it basically becomes a choice between potentially exhausting pools or cluttering old gen. Even without pooling, careful reuse of objects between iterations can be fraught with risk as any other allocations on the hot path can quietly shuttle any number of objects over to old gen. It's tough.

The summary is that if you are working in the 10s of microsecond space then any allocation will result in significant outliers regardless of JVM. Anyone in HFT is usually well aware of this issue.

However the picture gets more interesting for general purpose applications. I get to profile quite a few applications in the field across a number of domains. The thing this has taught me is that the biggest performance improvements often come from simply reducing the allocation rate. A little bit of allocation profiling and some minor code changes can give big returns. I recommend to all my customers that that run a profiler regularly and keep allocation to modest levels, regardless of application type, and the returns are significant for minimal effort. Our current garbage collectors seem to be not up to the job of coping with the new multicore world and large memory servers - Zing excluded :-)

Allocation is just as big an issue in the C/C++ world. In all worlds it is the reclamation rather than the allocation that is the issue. Just try allocating memory on one thread and freeing it on another in a native language and see how that performs. The big benefit we get in the native world is stack allocation. This has so many benefits besides cheap allocation and reclamation, it is also local in the hot OS page and does not suffer false sharing issues. In the multicore world I so miss stack allocation when using Java. It does not have to be a language feature, Escape Analysis could be better or an alternative like object explosion JRockit can help.

Pools can be useful technique to be used at times. Especially a big win when handing things back and forth between two threads. However pooling should be used with extreme caution and guided by measurement. Objects in pools need to be considered immortal.
 
Strings are also a particularly painful topic as far as the JVM is concerned: often forced into lots of unwanted copies and allocations. It's almost like you need to rebuild string operations from scratch on top of plain old byte arrays to do anything without a lot of spam.

Maybe a silly question, but are folks doing high performance work with the JVM writing their own string handling routines or something?

Not as strange as you would think. Besides the proliferation of XML and JSON encodings we have to deal with text even on some of the highest performance systems. Most financial exchanges still use the tag value encoded FIX protocol. Thankfully many exchanges are now moving to binary encodings but it will take time for the majority to migrate.

Codecs burn more CPU cycles than anything else in most applications. We need better examples for people to copy and the right primitives for building parsers, especially in Java. We need simple things like being able to go between ByteBuffer and String without intermediate copies that go to and from byte[]s, e.g. We need a constructor something like String(ByteBuffer bb, Charset cs) and a String method int getBytes(ByteBuffer bb, Charset cs).

We also need to be able to deal with text encoded numbers, dates, times, etc. in ByteBuffer and byte[] with out allocation and copying. Things like writeInt(int value) as ASCII or UTF-8 to and from ByteBuffer or byte[]. This way the bounds checking can be done once, take less reference indirections per operation, and totally avoid allocation.

When I build parsers I often write my own String handling classes and design APIs so they are composable. That is, have the lower level APIs trade complexity for efficiency and allow them to be wrapped/composed with APIs that are idiomatic or easier to use. This way the end user can have a choice rather than treating them like children who cannot make choices. On a quick scan of the proposed JSON API for Java 9, none of the important lessons seem to have been learned.
 
Interesting that you bring up NIO too. We've certainly seen some very strange behavior with NIO allocating huge amounts of memory off-heap when left unchecked, but it never occurred to me there might be allocation-related performance issues there.

I'd be here all day if you got me started on NIO :-(
 
Just take the simple example of Selector. You do a selectNow() then you get a set of selected key set that you must iterate over. Why not just take a callback to selectNow() or pass in a collection to fill. This and the likes of String.split() are examples of brain dead API design that causes performance issues. Richard and I have worked on this likes of this and he has at least listened to my whines and is trying to do something about it.

Vitaly Davidovich

unread,
Oct 28, 2014, 1:16:15 PM10/28/14
to mechanica...@googlegroups.com

+1.  It doesn't help that the mainstream java community continues to endorse and promote the "allocations are cheap" myth.

Sent from my phone

--

Rajiv Kurian

unread,
Oct 28, 2014, 1:30:32 PM10/28/14
to mechanica...@googlegroups.com


On Tuesday, October 28, 2014 2:38:32 AM UTC-7, Martin Thompson wrote:

On 28 October 2014 04:52, Tom Lee <m...@tomlee.co> wrote:
1. Allocation: You simply cannot have any allocation in your main path. Be very careful of lambdas in Java 8. Some that you think should not allocate, do allocate. And don't get me started on APIs like NIO!!!

Love to hear more about what you have to say here, Martin. We've sort of discovered this the hard way. Anything on the hot path needs to be extremely conservative about allocations (at least on the JVM -- can't speak to native code, but I can imagine that malloc/free could hurt just as much as GC there). Considered pooling briefly in the past, but our working set is often large enough that it basically becomes a choice between potentially exhausting pools or cluttering old gen. Even without pooling, careful reuse of objects between iterations can be fraught with risk as any other allocations on the hot path can quietly shuttle any number of objects over to old gen. It's tough.

The summary is that if you are working in the 10s of microsecond space then any allocation will result in significant outliers regardless of JVM. Anyone in HFT is usually well aware of this issue.

However the picture gets more interesting for general purpose applications. I get to profile quite a few applications in the field across a number of domains. The thing this has taught me is that the biggest performance improvements often come from simply reducing the allocation rate. A little bit of allocation profiling and some minor code changes can give big returns. I recommend to all my customers that that run a profiler regularly and keep allocation to modest levels, regardless of application type, and the returns are significant for minimal effort. Our current garbage collectors seem to be not up to the job of coping with the new multicore world and large memory servers - Zing excluded :-)

Allocation is just as big an issue in the C/C++ world. In all worlds it is the reclamation rather than the allocation that is the issue. Just try allocating memory on one thread and freeing it on another in a native language and see how that performs. The big benefit we get in the native world is stack allocation. This has so many benefits besides cheap allocation and reclamation, it is also local in the hot OS page and does not suffer false sharing issues. In the multicore world I so miss stack allocation when using Java. It does not have to be a language feature, Escape Analysis could be better or an alternative like object explosion JRockit can help.
Of course in the C/C++ world writing custom allocators is one of the first things to do in a high performance project. Culturally speaking it is more acceptable too. If you need to allocate memory on one thread and free it on another - you usually just allocate memory pass a const view into that buffer and pass it via a queue/ring-buffer to the other thread. When you want to free it you pass a message back to the owning thread saying you are done and the owning thread can put it back. Simple conventions like using a circular buffer to directly emplace objects (you have to come up with some header scheme if you want different kinds of objects) works really well especially wrt to fragmentation. On the receiving thread you can look at the header cast to a struct and process it. Again having first class structs make this kind of code much more manageable compared to writing error prone flyweights in Java over ByteBuffers or unsafe blobs. Of course processing logic becomes a big switch statement with object types which might not be the best for icache performance. In projects with very few types of objects one could have a circular buffer for each object type. That leads to better icache utilization. With languages like Rust with lifetime analysis you could go one step further and make sure that pointers to these structs (with lifetime equal to a stack allocated variable) are not leaked into heap allocated memory which could lead to a world of trouble (in Java or C/C++). I think for anything high performance writing a specific allocator is a big win and C/C++ have a huge win here because stack allocation/automatic variables happen to be one such custom allocator that happens to be highly usable in most projects. Any generic malloc/free operation even with thread local caches and cross thread malloc/free has very few guarantees about amount of fragmentation, nature of fragmentation etc. The JVM might have a great allocator but high performance projects can not count on any generic allocator even one as advanced as the JVMs. I am really excited about Rust because it encourages people to think about object lifetimes and layout and the relationship over the course of a program. GCed languages remove this concern and with it the ability to reason about memory layout.

Todd Montgomery

unread,
Oct 28, 2014, 1:51:16 PM10/28/14
to mechanica...@googlegroups.com
BTW, I'll be talking at QCon SF next week about the lessons that Martin, Richard, and I have had with Java 8 with the project that Martin has mentioned.

The primary allocation problem with Selector is as Martin mentioned, the selectedKeySet that is in use internally. It is possible to avoid the allocations within the HashSet by using your own set. This is the approach that netty has taken, for example. But the API is also close.... but not quite. Simple changes would make NIO so much better. There are also some DatagramChannel specific annoyances to deal with. The, IMHO, unneeded locking built into channels and Selector are quite disappointing for someone used to the native networking world to consider acceptable designs. Which leads to optimizations like finding ways to NOT use Selector unless the number of channels reaches a certain threshold, etc.

For C/C++, there are some additional things to consider. The freeing on different thread than allocation is a concern that can be handled via design marginally well. Normally allocators often work very poorly with some data structures due to fragmentation. Good thing is that most mechanically sympathetic data structures avoid that naturally. As Martin cautions, pooling can be quite useful or a total nightmare when done badly. I must say that as a general rule for C/C++, I tend to: (1) use the stack for copy-on-write/copy-on-save, and (2) use single threaded designs with ARC. I consider usage of std::shared_ptr to often create a smell.

-- Todd

Eric Wong

unread,
Oct 28, 2014, 2:28:26 PM10/28/14
to mechanica...@googlegroups.com
Martin Thompson <mjp...@gmail.com> wrote:
> Allocation is just as big an issue in the C/C++ world. In all worlds it is
> the reclamation rather than the allocation that is the issue. Just try
> allocating memory on one thread and freeing it on another in a native
> language and see how that performs. The big benefit we get in the native

Depends on the malloc implementation, of course. The GPLv3
locklessinc.com malloc handles it fine (as does my experimental
dlmalloc+wfcqueue fork[1])

I think the the propagation of things like Userspace-RCU (where
cross-thread malloc/free is a common pattern) will encourage
malloc implementations to treat this pattern better.

[1] git clone git://80x24.org/femalloc || http://femalloc.80x24.org/README

Martin Thompson

unread,
Oct 28, 2014, 2:56:59 PM10/28/14
to mechanica...@googlegroups.com
Does this implementation need to use a CAS to free the memory when returning it? If not, how does it avoid it? 

Eric Wong

unread,
Oct 28, 2014, 3:12:34 PM10/28/14
to mechanica...@googlegroups.com
No CAS. In femalloc, free via wfcqueue append only uses xchg.

git clone git://git.lttng.org/userspace-rcu.git
See ___cds_wfcq_append in urcu/static/wfcqueue.h

The wait-free free cost is shifted to allocation when it reads from the
queue, which may block if the free-ing thread is in the middle of an
append (I think the blocking is unlikely, but I haven't tested
this enough...)

locklessinc uses xchg in a loop to implement hazard pointers
and lock-free (but not wait-free) stack. It can loop inside the
free-ing thread, but not allocating thread (I'm less familiar
with this one).

Rüdiger Möller

unread,
Oct 28, 2014, 3:15:40 PM10/28/14
to mechanica...@googlegroups.com

Am Dienstag, 28. Oktober 2014 10:38:32 UTC+1 schrieb Martin Thompson:
Just take the simple example of Selector. You do a selectNow() then you get a set of selected key set that you must iterate over. Why not just take a callback to selectNow() or pass in a collection to fill. This and the likes of String.split() are examples of brain dead API design that causes performance issues. Richard and I have worked on this likes of this and he has at least listened to my whines and is trying to do something about it.

:)) I feel your pain .. 90% of java api's/libraries look that way. Great if jee standard api's enforce object creation by design such that the quality of the concrete implementation does not make a difference anymore.

Rüdiger Möller

unread,
Oct 28, 2014, 3:17:34 PM10/28/14
to mechanica...@googlegroups.com


Am Dienstag, 28. Oktober 2014 09:50:36 UTC+1 schrieb Richard Warburton:

The question always arises in my head - "Why are you using Strings?" If its because you want a human-readable data format then using a binary encoding which has a program which lets you pretty-print the encoding is just as good IMO and avoids a lot of these issues. If you're interacting with an external protocol which is text based I can appreciate that this isn't a decision you can make so easily.


Yep internet is running in debug mode, part of the problem is mixing the behaviour and encoding when defining a protocol. That's common unfortunately.


Martin Thompson

unread,
Oct 28, 2014, 3:18:37 PM10/28/14
to mechanica...@googlegroups.com
OK I understand. This is as fast as things can get when contented. However it is still fundamentally contended, the XHG while not a CAS is still a fully fenced instruction that must wait on the store buffer to drain before making progress. You don't want to do this too often in the middle of a high-performance algorithm.
 

Gil Tene

unread,
Oct 28, 2014, 3:52:50 PM10/28/14
to mechanica...@googlegroups.com
<< Putting on my contrarian "but allocations ARE cheap" hat. >>

I disagree, guys. These notions of "allocation causes outliers" is bogus. Pauses and glitches are not an inherent side-effect of allocation. They are side effects of bad allocator implementations.

The notions that "allocation slows things down" are also misleading. I.e. pooling is not an alternative to allocation., Pooling *IS* allocation, in a different, slower, but in a maybe-better-in-outliers implementation. Pooling is usually *slower* (not faster) than straight-line allocation.

I'm in no way advocating for allocation-for-no-good-reason. From a common-case speed perspective, efficient code that can completely avoid bringing in fresh line into the cache and pushing stale ones out will obviously tend to be faster (in the common case) than code that does move many such cache lines in and out. I'm not arguing otherwise, but that's ALL there is to the fundamental cost of allocation.

And that cost is often well hidden, and can approach zero. E.g. a good streaming allocator (as in 2MB TLABs) has virtually perfect access pattern predictability, so the incoming latency cost does not really exist (assuming the allocation pipeline works right).

And it's true that even such efficient allocation patterns will still evict other (cold) lines from the various cache hierarchies (L1, L2, L3, and other core's L2 and L1 in turn due to inclusiveness). But its mostly cold lines are susceptible to being pushed out of the L3 (and neighboring L2/L1). Hot lines that are part of your active working set will stay in the cache, due to the cache's LRU and LRU-approximating policies.

So where does that big bugaboo come from? That thing that makes allocation behave badly, and that causes smart people with real experience to work so hard to avoid it? 

The answer can often be found in a forced choice of runtime and platform, one that has terrible allocation side-effects. The hint is in statements like "...we wanted to make it work well on vanilla <runtime> implementations". The real cost and issue with allocation is in how it is handled in those vanilla implementations. It's simple: if you use a crap allocator, you'll get crap allocation-related performance artifacts. But conversely, if you use an allocator that supports the qualities you want (e.g. minimize allocation-related outlier artifacts and side-effects), that doesn't have to be the case. 

If the code you are building is *not* intended to be used by the world-as-a-whole on generic vanilla runtime, and you are trying to get stuff done and out to market quickly, using a good allocator (instead of engineering all allocations away) can save you tons of engineering time, efforts, money and headaches. The easiest place this pays off is in not having to write your own zero-allocation code for everything, and in not having to force your developers to only use "code-written-in-this-building". Leverage == money, time, etc. etc. etc.

With all that said, it's true that in Java, even with the best collector on the market (guess which one I'm talking about? Yup, the one that's >1000x better than all others in outlier behavior...), allocation will eventually lead to outlier glitches that range into the hundreds of micro-seconds (yes, that's micro, not milli). And with that happening once every several seconds (say, at a 1GB/sec sustained allocation rate in a 100GB heap), your allocation-related 99.999%'lie may creep up to hundreds of usec as a result. And it's true that with no allocation whatsoever done in your process, those outlier artifacts *could* have been completely avoided for threads that sit on dedicated isolcpus core... but normally, those artifacts are so far below the levels of other, much bigger outlier causes, that you won't be able to measure them.

The reality is that those situations (threads on dedicated isolcpus cores) are the ONLY case in which you would be able to measure the outlier effects of allocation when a good allocator and a good collector is involved. In all the situations I know of (e.g. bot using isolcpus, or having other threads that allocate in the same process), the noise generated by OS-level artifacts (scheduling, interrupts, paging, and other on-demand funny stuff) present far bigger outliers, and far more frequent ones, than those that are caused by allocation. 

For a practical example of how happy a good allocator (and associated background collector) can make people, and how much time they can save in avoiding the "workaround bugaboo" here is a link to a posting that really made my day yesterday: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2014-October/002045.html . We get a lot of similar feedback from people running low latency code on our platform, but most of those don't get to post publicly about it...

-- Gil.


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Todd Montgomery

unread,
Oct 28, 2014, 4:38:42 PM10/28/14
to mechanica...@googlegroups.com
There is an additional sentiment that I wish I understood. Seems that many people believe that parsing is easier with ASCII.

-- Todd

Martin Thompson

unread,
Oct 28, 2014, 4:50:54 PM10/28/14
to mechanica...@googlegroups.com
On 28 October 2014 19:52, Gil Tene <g...@azulsystems.com> wrote:
<< Putting on my contrarian "but allocations ARE cheap" hat. >>

I disagree, guys. These notions of "allocation causes outliers" is bogus. Pauses and glitches are not an inherent side-effect of allocation. They are side effects of bad allocator implementations.

By allocation we really need to be talking about object lifecycle management for clarity. Allocation is such a small part of the domain.

Most GC implementations - including C4 I believe - are based on the weak generational hypothesises, i.e. most objects die young. Have you measured an application making significant use of persistent data structures, such a FP style tries? They allocate a lot and the objects live medium term typically. Our current generation garbage collectors really suck at dealing with that type of lifecycle.

I think as the years go on this is going to become a really hot topic. Persistent data structures are really nice to use and reason about but have serious implementation issues that need considered. For me this is a fascinating topic that deserves serious research.

Martin...

Todd Montgomery

unread,
Oct 28, 2014, 4:54:59 PM10/28/14
to mechanica...@googlegroups.com
Our favorite neighborhood allocator is a tremendous revolutionary step up from other allocators. I can't say enough good about it.

However, a lot of times, the need for so much allocation is unfounded. For the case of Selector, it would seem to be quite shortsighted to consider HashSet to be a prudent thing to _lock_ in and NOT allow it to be the callers choice. This isn't just a performance issue. It would seem to be an API issue as well. Especially with lambdas being slapped in, I mean, so prevalent through the Java 8 API. But that is just an example. Many places in the JDK there are some baffling choices made wrt object ownership, creation, etc.

I don't think it is really simply the desire to have things work on other runtimes really. It's somewhat of a code style, API, usability thing as well.

-- Todd

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
Oct 28, 2014, 6:14:22 PM10/28/14
to mechanica...@googlegroups.com
Just because we can leverage a cool trick to be 10x-20x more efficient at some workload doesn't mean that the other sort is "bad" ;-)

C4 does leverage the weak generational hypothesis for efficiency, but not for mutator speed or for lack of outliers. The algorithm uses the exact same concurrent Mark-Compact mechanism for newgen and olden, which makes it works great even for workloads that exhibit significant churn long and medium lived objects (like caches, moving-time-window workloads, and many large in-memory analytic workloads do).

To be clear about what I mean by efficiency: I mean "amount of background CPU cycles spent per unit of allocation". This efficiency effect is pretty dramatic, as in usually 10x-20x fewer background GC cycles spent compared to single generation runs. So if we were to spend, say 0.5% of the system's available cpu cycles on GC in a given generational-friendly workload, that would grow to 5-10% if the workload was completely "adversarial" and the newgen filter was not getting us the efficiency it normally does. [And yes, these are typical numbers for latency sensitive apps, even when allocating multiple GC/sec].

Note that this math relates to background work and cpu cycles spent outside of your latency path. This is work that other cores (on potentially other sockets) are doing, without slowing down your latency sensitive threads or causing them any extra outliers.

Luckily, there is a way to buy back even this lost efficiency with an algorithm like C4 (whose cost is purely linear to the live set): For every doubling of empty memory in the heap, C4's efficiency doubles, and this hold even for adversarial, non-generational-friendly workloads. 

A different way to think about it is this: "all" the weak generational hypothesis buys us in Zing is the ability to keep up with the same workload (live set, allocation rate, promotion rate) in a smaller heap. But we can gain the same benefit by throwing cheap, empty memory at the problem. E.g. growing you empty heap from 5GB to 50-10GB will give the same 10x-20x efficiency boost. l And these days, an extra 100GB means no more than an extra $1700.

Gil Tene

unread,
Oct 28, 2014, 6:20:54 PM10/28/14
to mechanica...@googlegroups.com
+1.

Slow, inefficient code is slow and inefficient... And allocation for no good reason is plain waste.

But there is a huge difference between optimizing code, which often includes removing *wasteful* allocation from hot paths, and going for a "thou shalt not allocate anything" approach in the hot path, or in the entire process. 

The "thou shalt not allocate" approaches always normally driven by outliers, not by speed or the need to optimize critical paths. Anything in the process that allocates (critical path or not) will cause an eventual GC to occur, and if the GC you use causes pain, your predictable reaction to that pain would be "thou shalt not allocate"...

But when that pain isn't there, allocation stop being a sin. It's still not something you should do for no good reason, and you should still be optimizing your *hot* paths and your latency critic al paths, but allocation is virtually always faster than pooling, for example, and when one of the two is needed on a latency critical path, allocation usually results in a lower latency.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Todd Montgomery

unread,
Oct 28, 2014, 6:45:57 PM10/28/14
to mechanica...@googlegroups.com
I look at "thou shalt not allocate anything in the process" as unrealistic in a large system. Someone has to allocate if any state is kept at all. It's like "no side effects" as FP proponents point out. No state nor side effects? Well, that is a pretty nice, nearly useless box...

However, the "thou shalt not allocate anything in the data path" is just as much a cleanliness and efficiency point. I've done this even in C. A lot of times this is due to not wanting to leak state or leak memory or prevent leaky abstractions. But I think there is something to be said for making sure that state lifetime and object lifetime are tightly controlled and well thought out. For me, this usually leads to cleaner and simpler designs. It's so often overlooked.

One thing that concerns me with Java is that it is insanely easy to not care about object lifetime. Maybe this is good. But it does concern me as it is not possible to totally be ignorant of memory usage. Or else you eventually hold onto everything and OOM. But I do agree that we don't need to hold onto memory and hoard it as we would something like FDs, mmaps, etc. It's not THAT precise, we can make some very good enlightened tradeoffs. A little more memory for more efficiency is an awesome one that is a fundamentally good one.

It's too bad that often as a discipline we look at extremes way too much. All the interesting stuff is in the shades of grey zones....

-- Todd

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Kirk Pepperdine

unread,
Oct 29, 2014, 3:27:00 AM10/29/14
to mechanica...@googlegroups.com
The internet running in debug mode is one of the most brilliant statements on the state of logging today that I’ve ever heard.. I want to quote you!!!
All too common.. in fact I’d say that most people want a number of things and it’s all rolled up into one thing that doesn’t work well for anything… If it’s baked into a protocol than it’s almost impossible to untangle.

Regards,
Kirk
signature.asc

Todd Montgomery

unread,
Oct 29, 2014, 11:29:24 AM10/29/14
to mechanica...@googlegroups.com
Unless you decide to throw out the protocol and start over again. ala HTTP/2.

-- Todd

Kirk Pepperdine

unread,
Oct 30, 2014, 3:28:54 AM10/30/14
to mechanica...@googlegroups.com
Martin,

I read one of your earlier posts about how allocations cost and I have to say it is completely consistent with my experiences also. My current threshold for starting to look at memory efficiency as a way of improving performance is ~3-400MB/sec. Yeah, that’s it!!! If allocation rates exceed that then I know I can get significant gains in performance but working on memory efficiency. You can’t tune this out by turning GC knobs on the JVM. You need to find out who’s allocating and kill it.

I hear what Gil is saying and as usual, it’s difficult to argue against his deep knowledge of how things work but then… every time I’ve seen a team exhaust the gains they get with an execution profiler I’ll look at a churn and if it's higher than that threshold I know I’m going to look like a genius in that engagement (even though we all know the truth ;-))

Regards,
Kirk

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
signature.asc

Mamontov Ivan

unread,
Oct 30, 2014, 7:59:17 AM10/30/14
to mechanica...@googlegroups.com
What about true RISC processors, for example POWER8 with faster processor, with more cache and memory?

пятница, 24 октября 2014 г., 8:57:04 UTC+3 пользователь Gil Tene написал:
[Some of this is cut/pasted from emails in which I had answered similar questions]

A few clarification points before you answer the question themselves:

You mentioned that throughput is not the main concern (<1K messages/sec expected), and that you want to minimize the latency between receiving an event and sending a response. But when you say "minimize the latency", do you mean:
a) you want to minimize the best case latency
b) you want to minimize the median latency
c) you want to minimize the worst experienced latency
d) some other level (e.g. minimize the 99.9%'lie)

The answers to some of your questions will vary depending on your answers. And the right answer is probably to have goals for at least three of the points above (b, c, and d at least).

If all you care about is the best case and the median latency, you are probably best off with a single socket, highest clock rate, small core count setup with HT turned off. So what if your latency jumps to 10-20msec every once in a while? You'll great best case and median numbers to show off, and you said you don't care about high percentiles.

However, if you care at all about you higher percentiles, you'll want to make sure to have enough cores to accommodate the peak number of runnable threads in your system, with some room to spare, and even then you may want to play some taskset, numactl, or isolcpu games. This is where higher core counts, dual socket setups, and turning HT on comes into play: they all help keep the high percentiles and outliers down (in both frequency and magnitude).

As for NICs. My experience is that burning a core or two on OpenOnload or the like (spinning with no back) is worth a couple of usec in the common case, and that this is especially true for non-saturated workloads (like your <1K/sec), where each latency is measured from a standstill (responding to a single event). If what you care about is latency (and not power, space, or saving a few $), burning those cores is a cheap way to get what you want. Same goes for the disruptor. Burning cores on controlled & dedicated spinning buys you better latency across the board, as long as you have plenty of cores to burn (it gets really bad when you don't have those cores).

When it comes to comparing clock frequencies, I usually tell people to compare the max turbo boost frequencies, rather than the nominal ones. E.g. an E5-2697 V3 has 14 cores, 35MB of L3, and shows 2.6GHz nominal. An E5-2687W V3 has 10 cores, 25MB of L3, and shows 3.1Ghz nominal. And an E5-2667 V3 has 8 cores, 20MB of L3, and shows 3.2Ghz nominal. So you may think the 3.2GHz number is best. But if you look carefully, all three chips have the same Turbo Boost frequency (3.6Ghz), and chances are that with the same number of cores actually busy, they'll all be running at the same frequency. But the larger chip (E5-2697) has more cores, which can help with outliers and allow for more generous "burning" of cores. And it has a significantly bigger L3 which can help with everything.

When you are latency sensitive, your best hint as an initial filter is power envelope: you should be looking at and choosing between chips in the higher power envelope rankings (135-160W usually). The lower power chips will keep your frequencies lower. [Also note that desktop chips tend to be classified by their turbo boost, as in "up to 4.4GHz", while server chips focus on nominal freq. (2.6GHz rather than "up to 3.6GHz"), that probably marketing driven].

For the common single socket vs. dual socket question, and Haswell E vs E5, I usually lean towards the E5s. Yes, something like the i7-4790K has a higher clock rate: 4.0GHz nominal, boosting to 4.4Ghz, but it only has 4 physical cores, and with the disruptor and a good NIC stack, you'll be burning those cores down fast, and may have 10-20msec outliers to deal with quickly, causing you to prematurely give up on spinning, and start paying in common case latency for this lack or spinning cores. Another E5 benefit is memory capacity and bandwidth. i7s tend to peak at 32GB, and have only 2 memory channels, while E5s have 4 channels, support faster DRAMs (e.g. 2133 vs. 1600), and a lot more  memory (typical config of 128GB per socket is commonplace in 2 socket servers today. E.g. that's what the latest EC2 machines seem to be built with).

As for 1 vs. 2 sockets: most 1 vs. 2 socket situations where the 1 socket system "wins" can be matched with nuactl and locking down workload (including their RAM) to one of the two sockets. 2 socket setups have the benefit of allowing you to separate background stuff (like the OS and rarely active processes or threads, and potentially your own non-crticial threads) from you critical workload. They do require more care, e.g. with things like IRQ assignment and making sure your packets are processed (in the network stack) in the same L3 that your consuming threads will be on, but at the end that extra capacity can be a lifesaver. 

Bottom line:

When people ask me about a "good" system for low latency (and this happens a lot), I answer that if $$ or power are not the drivers, and latency is, my current recommendation is to go with a 2 socket E5 (with something like an E5-2697 V3 rightness), and 128GB-256GB of memory. The street price for that is somewhere between $8K and $12K ($3K-$4K of which is spent when focusing on the faster, higher core count Xeons). I also recommend investing in a good 10G NIC in (like Solarflare) which will add some to that price, but buy you a few more usec.

On Thursday, October 23, 2014 1:35:42 PM UTC-7, Michael Mattoss wrote:
Hi all,

I wrote an application based around the Disruptor that receives market events and responses accordingly.
I'm trying to decide which hardware configuration I should purchase for the server hosting the app, that would minimize the latency between receiving an event and sending a response.
I should mention that the app receives < 1K messages/second so throughput is not much of an issue.

I tried to do some research but the amount of data/conflicting info is overwhelming so I was hoping some of the experts on this group could offer their insights.

How should I choose the right CPU type? Should I go with a Xeon E5/E7 for the large cache or should I favor a high speed CPU like the i7 4790K (4.4Ghz) since 99% of work is done in a single thread?
What about the new Haswell-E CPU's which seem to strike a good balance between cache size & core speed and also utilize DDR4 memory?
Does it matter if the memory configuration of a 16GB RAM for example is 4x4GB or 2x8GB?
Should I use an SSD HDD or a high performance (15K RPM) mechanical one? (the app runs entirely in memory of course and the BL thread is not I/O bound, but there's a substantial amount of data written sequentially to log files). How about a combination of the two (SDD for the OS and mechanical one for log files?
Is it worth investing in a high performance NIC such as those offered by Solarflare if OpenOnload (kernel bypass) is not used (just for the benefit of CPU offloading)?

Any help, suggestions and tips you may offer would be greatly appreciated.

Thank you!
Michael

Vitaly Davidovich

unread,
Oct 30, 2014, 9:54:13 AM10/30/14
to mechanica...@googlegroups.com

This is consistent with my experience as well, both in java and .NET (not necessarily the same allocation rate Kirk observes,  but the general notion).

There's a lot of work involved in stopping mutators, doing a GC, and restarting them (safepoint, stack walking, OS kernel calls, icache and dcache pollution, actual GC logic, etc).  Even if collections are done concurrently,  there're resources being taken away from app threads.  Sure, if you have a giant machine where the GC can be fully segregated from app threads, maybe that's ok.  But even then, I'd rather use more of that machine for app logic rather than overhead, so keeping allocations to a minimal is beneficial.

In native languages, people tend to care about memory allocation more (some of that is natural to the fact that, well, it's an unmanaged environment, but it's also due to being conscious of its performance implications), whereas in java and .NET it's a free-for-all; I blame that on the "allocations are cheap" mantra.  Yes, an individual allocation is cheap, but not when you start pounding the system with them.  Native apps tend to use arenas/pools for long-lived objects (especially ones with specific lifetimes, e.g. servicing a request), and stack memory for temporaries.  Unfortunately,  as we all know, there's no sure way to use stack memory (unless you want to destructure objects into method arguments, which is ugly and brittle at best).

Quite a few of the big well known "big data"/perf sensitive java projects have hit GC problems, leading to solutions similar to what people do in native land.  I don't have experience with Azul's VM so maybe it really would solve most of these problems for majority cases.  But personally, I'm greedy - I want more of my code running than the VM's! :)

Sent from my phone

Gil Tene

unread,
Oct 30, 2014, 12:02:58 PM10/30/14
to mechanica...@googlegroups.com
These "keep allocation rates down to 640KB/sec" (oh, right, you said 300MB/sec) guidelines are are purely driven by GC pausing behavior. Nothing else.

Kirk (and others) are absolutely right to look for such limits when pausing GCs are used. But the *only* thing that makes allocation rates a challenge in todays Java/.NET (and other GC based) systems is GC pauses. All else (various resources spent or wasted) falls away with simple mechanical sympathy math. Moore's law is alive and well (for now). And hardware-related sustainable allocation rate follows it nicely. 20+GB/sec is a very practical level on current systems when pauses are not an issue. And yes, that's 50x the level at which people seem to "tune" for by crippling their code or their engineers...

Here is some basic mechanical sympathy driven math about sustainable allocation rates (based mostly on Xeons):

1. From a speed and system resources spent perspective, sustainable allocation rate roughly follow Moore's law for the past 5 Xeon CPU generations.

1.1 From a CPU speed perspective:

- The rate of sustainable allocation of a single core (at a given frequency) is growing very slowly over time (not @ Moore's law rates, but still creeping up with better speed at similar frequency, e.g. Haswell vs. Nehalem).

- The number of cores per socket rate per socket is growing nicely, and with it the overall overall CPU power per socket (@ roughly Moore's law). (e.g. from 4 cores per socket in late 2009 to 18 cores per socket in late 2014).

- The overall CPU power available to sustain allocation rate per socket (and per 2 socket system, for example) is therefore growing at roughly Moore's law rates.

1.2 From a cache perspective:

- L1 and L2 cache size per core have been fixed for the past 6 years in the Xeon world.

- The L3 cache size per core is growing fairly slowly (not at Moore's law rates), but the L3 cache per socket has been growing slightly faster number of cores per socket. (e.g. from 8MB/4_core_socket in 2009 to 45MB/18_core_socket in late 2014).

- The cache size per socket has been growing steadily at Moore's law rates.

- With the cache space per core growing slightly over time, cache available for allocation work per core remains fixed or better.

1.3 From a memory bandwidth point of view:

- The memory bandwidth per socket has been steadily growing, but at a rate slower than Moore's law. E.g. A late 2014 E5-2690 V3 has a max bandwidth of 68GB/sec. per socket. A late 2009 E5590 had 32GB/sec of max memory bandwidth per socket. That's a 2x increase over a period of time during which CPU capacity grew by more than 4x.

- However, the memory bandwidth available (assume sustainable memory bandwidth is 1/3 or 1/2 of max), is still way up there, at 1.5GB-3GB/sec/core (that's out of a max of about 4-8GB/sec per core, depending on cores/socket chosen).

- So while there is a looming bandwidth cap that may hit us in the future (bandwidth growing slower than CPU power), It's not until we reach allocation levels of ~1GB/sec/core that we'll start challenging memory bandwidth in current commodity server architectures. 

- From a memory bandwidth point of view, this translates to >20GB/sec of comfortably sustainable allocation rate on current commodity systems..

1.4 From a GC *work* perspective:

- From a GC perspective, work per allocation unit is a constant that the user controls (with ratio or empty to live memory).

- On Copying or Mark/Compact collectors, the work spent to collect a heap is linear to the live set size (NOT the heap size).

- The frequency at which a collector has to do this work roughly follows: allocation_rate / (heap_size - live_set_size).

- The overall work per time unit is therefore follows allocation rate (for a given live_set_size and heap_size).

- And the overall work per allocation unit is therefore a constant (for a given live_set_size and heap_size)

- The constant is under the user's control. E.g. user can arbitrarily grow heap size to decrease work per unit, and arbitrarily shrink memory to go the other way (e.g. if they want to spend CPU power to save memory).

- This math holds for all current newgen collectors, which tend to dominate the amount of work spent in GC (so not just in Zing, where it holds for both newgen and olden).

- But using this math does require a willingness to grow the heap size with Moore's law, which people have refused to do for over a decade. [driven by the unwillingness to deal with the pausing effects that would grow with it]

- [BTW, we find it to be common practice, on current applications and on current systems, to deal with 1-5GB/sec of allocation rate, and to confortably do so while spend no more than 2-5% of overall system CPU cycles on GC work. This level seems to be the point where most people stop caring enough to spend more memory on reducing CPU consumption.]

2. From a GC pause perspective:

- This is the big bugaboo. The one that keeps people from applying all the nice math above. The one that keeps Java heaps and allocation rates today at the same levels they were 10 years ago. The one that seems to keep people doing "creative things" in order to keep leveraging Moore's law and having programs that are aware of more than 640MB of state.

- GC pauses don't have to grow with Moore's law. They don't even have to exist. But as long as they do, and as long as their magnitude
grows with the attempt to linearly grow state and allocation rates. And by magnitude, we're not talking about averages. We're talking about the worst thing people will accept during a day.

- GC pauses seem to be limiting both allocation rates and live set sizes.

- The live set size part is semi-obvious: If your [eventual, inevitable, large] GC pause grows with the size of your live set or heap size, you'll cap your heap size at whatever size causes the largest pause you are willing to bear. Period.

- The allocation rate part requires some more math, and this differs for different collector parts:

2.2 For newgen collectors: 

- By definition, a higher allocation rate requires a linearly larger newline sizes to maintain the same "objects die in newline" properties. [e.g. if you put 4x as many cores to work doing the same transactions, with the same object lifetime profiles, you need 4x as much newgen to avoid promoting more things with larger pauses].

- While "typical" newgen pauses may stay just as small, a larger newgen linearly grows the worst-case amount of stuff that a newgen *might* promote in a single GC pauses, and with it grows the actual newgen pause experienced when promotion spikes occur. 

- Unfortunately, real applications have those spikes every time you read in a bunch of long-lasting data in one shot (like updating a cache or a directory, or reading in a new index, or replicating state on a failover), 

- Latency sensitive apps tend to cap their newgen size to cap their newgen pause times, in turn capping their allocation rate.

2.3 For oldgen collectors:

- Oldgen collectors that pause for *everything* (like ParallelGC) actually don't get worse with allocation rate. They are just so terrible to begin with (pausing for ~1 second per live GB) that outside of batch processing, nobody would consider using them for live sets larger than a couple of GB (unless they find regularly pausing for more than a couple of seconds acceptable).

- Most Oldgen collectors that *try* to not pause "most" of the time (like CMS) are highly susceptible to allocation rate and mutation rate (and mutation rate tends to track allocation rate linearly in most apps). E.g. the mostly-concurrent-marking algorithms used in CMS and G1 must revisit (CMS) or otherwise process (G1's SATB) all references mutated in the heap before it finishes. The rate of mutation increases the GC cycle time, while at the same time the rate of allocation reduces the time the GC has in order to complete it's work. At a high enough allocation rate + mutation rate level, the collector can't finish it's work fast enough and a promotion failure or a concurrent mode failure occurs. And when that occurs, you get that terrible pause you were trying to avoid. 

- As a result, even for apps that don't try to maintain "low latency" and only go for "keep the humans happy" levels, most current mostly-concurrently collectors only remain mostly-concurrent within a limited allocation rate. Which is why I suspect these 640KB/sec (oh, right, 300MB/sec) guidelines exist.


Bottom line:

With less pauses comes less responsibility.

[ I need to go do real work now... ]
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Ariel Weisberg

unread,
Oct 30, 2014, 12:24:35 PM10/30/14
to mechanica...@googlegroups.com
Hi,

My operating theory for the past few years has been that as long as all allocations die young you can allocate at an almost arbitrary rate by increasing the size of the young generation to reduce GC frequency to an acceptable level. I'm an not proposing that this is free from a cache efficiency perspective, but I don't have a mental cost model for exactly how expensive it is.

This is based on the belief that a copying young gen collector only pays a cost for the number of live objects.

Keep in mind I am not approaching this from the HFT mindset, but rather the commodity database mindset where acceptable 99.999% time scales are milliseconds or tens of milliseconds and I see young gen GCs once or twice a second as acceptable.

Is this an accurate way to think about the cost of allocation rate? Does anyone have some concrete metrics on the impact of allocation rate on cache performance assuming GCs are still infrequent?

Ariel
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,
Oct 30, 2014, 1:10:19 PM10/30/14
to
Ariel, putting GC pauses and cost aside completely (in order to answer your question below), here is how I build the mental model on the allocation side:

First, lets put aside the notion that rates have anything to do with it: Faster or slower has nothing to do with it. Allocation rate (measured in MB/sec) have no effect on cache behavior. It's the ratio between units of allocation(measured in bytes) per work unit (measured in something like instructions/op) that we are really talking about, I think.

Now lets assume a certain workload with a certain work done per operation. That work includes some amount of allocation (>=0). The costs associated with that allocation can be generally sliced into the following high level parts:

1. Cycles spent on math and pointer chasing involved in figuring out where the allocation should be made.

2. Cycles spent bringing the allocated memory into the cache (if it's not already there).

3. Cycles spent on initializing the allocation area (when it's in the cache, so not counting cost to bring into cache).

4. Side effects (on other parts of you operation) of evicting other useful stuff form the cache to make room for the stuff the allocation brings in.

Depending on how your allocation mechanism works, different parts of the above will have different costs. you want to build your mental model according to the actual allocation mechanism you use:

E.g. non-thread-local allocators will tend to have a fairly high cost for #1 above in multi-threaded systems. (I count things that do thread-local caching of blocks or lists as "thread local" even when they periodically go to get material from shared pools in an easy to amortize way). A CAS (or atomic SWAP) is a very high expense item when it comes to allocation and pipeline behavior.

E.g. a thread-local stack based allocator, or any thread-local effectively-LIFO pool allocator will tend to have a low cost for #2 and #4 above.

E.g. A bump-pointer allocator (used in TLABs in most GC systems, and in area allocators) has the [theorietically] lowest possible cost for #1, and can can usually complete hide the cost of #2 in prefect pipelines.

E.g. Virtually all allocators that cannot reuse stale state in allocated area have the same exact cost for #3. [i.e. anything that zero all the contents at some point in either allocation or use, which is true for most allocators or allocations]

#4 above is the hardest one to make general statements about. The side effects of bringing in cache lines (when you ignore the costs of #1,#2,#3 above) are very application and machines specific.

E.g. If you have a hot spinning workload that frequently touches all it's hot lines, and an LRU (or LRU-like) cache, streaming lines in and out for allocation will not have a significant effect on your workload.

E.g. If your thread strongly benefits from caching large amounts (Megabytes) of state that is infrequently accessed, it's own allocation pattern may cause eviction and related cost in subsequent misses.

E.g. If you have a critical but infrequently triggered latency path that cannot actively keep all it's cache state hot by touching every part of it all the time, it will be susceptible to neighboring workloads (on the same socket, core, etc.) evicting it's state from the cache, and suffer longer latencies on each operation due to increased miss rates. [BTW, that's why I tell "truly crazy" people that will do whatever it takes to keep latencies low to build more complicated spin loops that actually touch all their critical data in the loop].

You can go on and on from here. But these are the basic building blocks I use to reason about allocation costs.

Rüdiger Möller

unread,
Oct 30, 2014, 4:50:40 PM10/30/14
to mechanica...@googlegroups.com
Wow, this thread is packed with useful information. So sorry for offtopic'ing again .. 

Am Dienstag, 28. Oktober 2014 21:38:42 UTC+1 schrieb toddleemontgomery:
There is an additional sentiment that I wish I understood. Seems that many people believe that parsing is easier with ASCII.


true. ofc it isn't (charset issues, upper/lower case, 0xD 0xA ..)

Am Mittwoch, 29. Oktober 2014 08:27:00 UTC+1 schrieb Kirk Pepperdine:
The internet running in debug mode is one of the most brilliant statements on the state of logging today that I’ve ever heard.. I want to quote you!!!

[author blushes]

- ruediger

 

Kirk Pepperdine

unread,
Oct 31, 2014, 3:07:31 AM10/31/14
to mechanica...@googlegroups.com
On Oct 30, 2014, at 2:54 PM, Vitaly Davidovich <vit...@gmail.com> wrote:

This is consistent with my experience as well, both in java and .NET (not necessarily the same allocation rate Kirk observes,  but the general notion).


I’m curious to know what is the threshold on rate that make you want to look at allocations?

There's a lot of work involved in stopping mutators, doing a GC, and restarting them (safepoint, stack walking, OS kernel calls, icache and dcache pollution, actual GC logic, etc).  Even if collections are done concurrently,  there're resources being taken away from app threads.  Sure, if you have a giant machine where the GC can be fully segregated from app threads, maybe that's ok.  But even then, I'd rather use more of that machine for app logic rather than overhead, so keeping allocations to a minimal is beneficial.


Humm, I see huge benefits of lowering allocation rates even without GC being a problem. IOWs GC throughput can even be 99% and if the allocation rates are 1G (for example), working to push them down always brings a significant gain.

In native languages, people tend to care about memory allocation more (some of that is natural to the fact that, well, it's an unmanaged environment, but it's also due to being conscious of its performance implications), whereas in java and .NET it's a free-for-all; I blame that on the "allocations are cheap" mantra.  Yes, an individual allocation is cheap, but not when you start pounding the system with them.  Native apps tend to use arenas/pools for long-lived objects (especially ones with specific lifetimes, e.g. servicing a request), and stack memory for temporaries.  Unfortunately,  as we all know, there's no sure way to use stack memory (unless you want to destructure objects into method arguments, which is ugly and brittle at best).

Quite a few of the big well known "big data"/perf sensitive java projects have hit GC problems, leading to solutions similar to what people do in native land.  I don't have experience with Azul's VM so maybe it really would solve most of these problems for majority cases.  But personally, I'm greedy - I want more of my code running than the VM's! :)


This rush to off-heap memory IMHO isn’t justified. I’ve recently been looking at Neo4J, Cassandra, Hazelcast and a whole host of other like technologies. Instead of solving the real problems that have in the implementation they’ve thrown caution to the wind and then gone off heap. The real problem they all have is that they are written is a very memory inefficient way. I bet that if they were to improve the memory efficiency that the motivation to go off-heap would be weak at best. Lets face it.. going off-heap is *cool* and has nothing to do with real requirements.

Regards,
Kirk

signature.asc

Kirk Pepperdine

unread,
Oct 31, 2014, 3:13:35 AM10/31/14
to mechanica...@googlegroups.com

On Oct 30, 2014, at 5:24 PM, Ariel Weisberg <arielw...@gmail.com> wrote:

> Hi,
>
> My operating theory for the past few years has been that as long as all allocations die young you can allocate at an almost arbitrary rate by increasing the size of the young generation to reduce GC frequency to an acceptable level. I'm an not proposing that this is free from a cache efficiency perspective, but I don't have a mental cost model for exactly how expensive it is.

Not my experience. I have a tuned a number of applications with this behavior and the results have been consistently the same. Further more, a majority of these tuning experiences have occurred *after* the team has exhausted all (reasonable) gains by execution profiling their application. My gains with reducing allocations often are much larger than the gains they’ve been making with execution profiling.

>
> Is this an accurate way to think about the cost of allocation rate? Does anyone have some concrete metrics on the impact of allocation rate on cache performance assuming GCs are still infrequent?

Indeed it would be nice to know so that I’d have a better explanation, much better than hand-waving ;-)

Kirk

signature.asc

Martin Thompson

unread,
Oct 31, 2014, 3:14:12 AM10/31/14
to mechanica...@googlegroups.com
On 31 October 2014 07:07, Kirk Pepperdine <ki...@kodewerk.com> wrote:

This rush to off-heap memory IMHO isn’t justified. I’ve recently been looking at Neo4J, Cassandra, Hazelcast and a whole host of other like technologies. Instead of solving the real problems that have in the implementation they’ve thrown caution to the wind and then gone off heap. The real problem they all have is that they are written is a very memory inefficient way. I bet that if they were to improve the memory efficiency that the motivation to go off-heap would be weak at best. Lets face it.. going off-heap is *cool* and has nothing to do with real requirements.

Hmmm. I get to see many very real requirements to go off the Java heap. Inter Process Communication being just one. 

The products you list are all data stores. How would you operate a 100GB+ heap in Java without using Zing and cope with the GC pauses or enable memory prefetching and page locality? The only way I can see to do this is to stuff the data into primitive arrays but then that is just as awkward as going off heap. 

Kirk Pepperdine

unread,
Oct 31, 2014, 3:36:26 AM10/31/14
to mechanica...@googlegroups.com

This rush to off-heap memory IMHO isn’t justified. I’ve recently been looking at Neo4J, Cassandra, Hazelcast and a whole host of other like technologies. Instead of solving the real problems that have in the implementation they’ve thrown caution to the wind and then gone off heap. The real problem they all have is that they are written is a very memory inefficient way. I bet that if they were to improve the memory efficiency that the motivation to go off-heap would be weak at best. Lets face it.. going off-heap is *cool* and has nothing to do with real requirements.

Hmmm. I get to see many very real requirements to go off the Java heap. Inter Process Communication being just one. 

The products you list are all data stores. How would you operate a 100GB+ heap in Java without using Zing and cope with the GC pauses or enable memory prefetching and page locality? The only way I can see to do this is to stuff the data into primitive arrays but then that is just as awkward as going off heap. 

I’m not against going off-heap. I’m just saying that in the vast majority of the cases I run into there isn’t a real need to go off-heap. I’m working with 2 clients that have “need to be faster than our competitors” like requirements. So far we’ve not seen the need to go off-heap to satisfy this requirement. Not going off-heap means they’ve been able to remain simple and remaining simple has had immeasurable benefits. That said, my guess is that these requirements are no where near as stringent as the ones you often face so...

Regards,
Kirk

signature.asc

Martin Thompson

unread,
Oct 31, 2014, 4:01:43 AM10/31/14
to mechanica...@googlegroups.com
I totally agree that striving for simplicity should be your top ranked requirement. Simple so often is the fast option for performance anyway so a double win.

There are a few unfortunate elephants in the room when it comes to large memory applications. For me these are having control of memory layout and GC. Gil and I have proposed ObjectLayout to cope with the former. In my experience, and I'm an old C/C++ programmer so that skews my view, going off heap is no more tricky than understanding and turning CMS or G1 at scale. Plus if I go off heap for large datasets to avoid GC, I can also control layout as an extra benefit. So simple often is going off heap rather than battling CMS and its quirks which I'm sure you know very well :-) A good garbage collector should not require much configuration.

Give me C4 as a standard collector plus ObjectLayout and, other than IPC, I see little need to go off heap. Unfortunately I don't have these in my everyday world.

Jean-Philippe BEMPEL

unread,
Oct 31, 2014, 4:11:56 AM10/31/14
to mechanica...@googlegroups.com
+1 with Kirk,

Measure, Don't premature !

Richard Warburton

unread,
Oct 31, 2014, 5:02:25 AM10/31/14
to mechanica...@googlegroups.com
Hi,

Interesting conversation gents.

The products you list are all data stores. How would you operate a 100GB+ heap in Java without using Zing and cope with the GC pauses or enable memory prefetching and page locality? The only way I can see to do this is to stuff the data into primitive arrays but then that is just as awkward as going off heap.

I think the thing that's missing from this conversation is an explanation of the tradeoffs and context in which these decisions are made.

Things like databases and IPC have generally quite simple allocation patterns. With IPC its often just allocating a slab of memory when you create the IPC channel and freeing it when you close the IPC channel. In situations like this manual memory management isn't very hard. Its also a case where its easy to get locality of reference benefits because you're putting a large block of things which are likely to be access close in time, close together in space. Databases often have similar properties. Big blocks of very similar, slab allocatable objects with obvious locality of reference wins.

Now let's suppose you're writing a compiler or static analysis tool in Java. This is an application which has a lot of near random memory accesses which are hard to make sequential - walking a lot of tress and graphs. It has a lot of branches which are hard to predict. Its a problem domain which is fundamentally harder to apply mechanically sympathetic approaches. Not saying you can't do anything or shouldn't try - but its a less easy win. Its also an application domain which allocates lots of small objects. Its not obvious how going off-heap helps.

I've deliberately picked two things at the opposite end of the spectrum here to demonstrate the point but the reality is never "off heap is a segfault & complexity inducing nightmare" vs "off heap is a great speedup technique". The reality is that if you have simple memory allocation and harvesting patterns and obvious locality of reference wins then it could be a pragmatic choice for your application. If not - then its all a moot discussion topic.

regards,

  Richard Warburton

Kirk Pepperdine

unread,
Oct 31, 2014, 6:03:11 AM10/31/14
to mechanica...@googlegroups.com
Well put!!!

signature.asc

Kirk Pepperdine

unread,
Oct 31, 2014, 7:17:35 AM10/31/14
to mechanica...@googlegroups.com
Hi Martijn,

>
> There are a few unfortunate elephants in the room when it comes to large memory applications. For me these are having control of memory layout and GC. Gil and I have proposed ObjectLayout to cope with the former.

I had a lot of fun sitting in Gil’s kitchen looking over the code. So having seen and now used ObjectLayout I think this is a great proposal and it’s one that’s dearly needed given the problems..

> In my experience, and I'm an old C/C++ programmer so that skews my view, going off heap is no more tricky than understanding and turning CMS or G1 at scale.

Yes, but we’re not all old C/C++ programmers and we don’t really have good JVM support for this type of stuff.

> Plus if I go off heap for large datasets to avoid GC, I can also control layout as an extra benefit. So simple often is going off heap rather than battling CMS and its quirks which I'm sure you know very well :-) A good garbage collector should not require much configuration.

No, a good collector shouldn’t need that much configuration or force you to resort to tactics to avoid it. So in absence of a collector that doesn’t require configuration and no way to avoid pointer chasing…

>
> Give me C4 as a standard collector plus ObjectLayout and, other than IPC, I see little need to go off heap. Unfortunately I don't have these in my everyday world.

SOT, I need to write about GemStone/S (Smalltalk) and GemStone/J (Java) but unfortunately I don’t have time at the moment for a longer description. In Smalltalk the entire DB was “in-memory” even though most of it was “off-heap”. GemStone tried to do the same thing for Java but unfortunately that didn’t turn out as well. Some of the problems were due to the complexity of typing and certain keywords such as transient, volatile (and even static). However it was able to keep data off-heap in a completely orthogonal manner while maintaining singularity and identity without the need for foreign key relationships. In there is was a class called a ClusterBucket. It’s purpose in life was to keep data that needed to be close, close. It solve the pointer chasing problem at the page level. I see the essence of that idea in ObjectLayout in that you solve the pointer chasing problem orthogonally to the application. However GemStone/J had a number of difficulties it was never able to adequately deal with. It all had to do with copy semantics and coherency. GemStone/S solved these problems by just making the persistent space part of the normal space, no copies, no copy problems. This wasn’t an option in Java and the split between Java heap and persistent heap almost worked but not quite. There were always some small problems that were very hard to solve. I think I see some of the “copy” problems in ObjectLayout but it take time for me to get through Gil’s thinking on stuff so I wouldn’t be so bold as to say they are there but…

— Kirk

signature.asc

Vitaly Davidovich

unread,
Oct 31, 2014, 9:29:53 AM10/31/14
to mechanica...@googlegroups.com

I've had allocation rates all over the spectrum, anywhere from 1GB to 200MB per second, and this is ranging from java server daemons to .NET gui apps.  The rate which bogs down the app is, of course, dependent on machine class and application workload.

By the way, I'm not a proponent of "zero allocation at runtime at all" - I don't mind allocating where it makes sense, but that's almost always on slow/exceptional paths.  Hot/fast paths, however, should not allocate.  I'm not a masochist that likes to tune GC knobs - my "tuning" involves being a cheapskate with memory :).

All of this depends on the performance requirements of the system, obviously.  If pausing for several tens/hundreds of millis and above every few seconds is ok, then sure, don't contort the code.  If, however, we're talking about consistent performance in sub-millisecond range, then every little waste adds up and you get death by a thousand cuts.  I'm sure we've all seen profiles (cpu + mem) where there's no big elephant in the room but rather a bunch of small leaches.

By the way, off-heap isn't the only solution I was referring to when I mentioned other projects hitting gc problems.  Stackoverflow, for example, simply started using structs more (this is .net).  Roslyn team (c# compiler written in c#) started reusing certain objects/pooling and avoiding subtle boxing in hot paths, etc.  Cassandra went off-heap, yes, but they determined that their object lifetime just didn't jive well with how collectors want things to be in order to stay efficient.

My point isn't to be draconian here and not allocate at all (we're not talking about a microcontroller for a space rover here) but for people to stop using the "allocation is cheap" mantra in an effort to dismiss being wasteful.  I'm convinced that this statement has caused more harm than good.  People build libraries and allocate needlessly here and there, thinking "no big deal, these are short-lived, die young, don't get copied, don't contribute to young GC time, and I only do a handful of these, blah blah blah".  Then you put tens of these libraries together to form some app, and bam, each one of those "isolated mindset designs", which aren't terrible if they were the only thing running, now bring the system to its knees or leave perf on the table for no good reason.

Gil's mechanical sympathy math is nice analysis, but it seems to paper over the cache implications a bit.  Sure, cores can sustain memory throughput of GB/s but it's not like we're not using those channels for other things - I want to use as much of the resources towards my app logic, and not towards runtime infra/overhead.  Object reuse/pooling doesn't have to be slower than allocation, I'm not sure why that's being made to sound like a fact.  If we're talking about threadsafe pools that service multiple threads, sure, but that's not the only type of pool/use case;  you can reuse objects for a single thread of execution.

Let's also not forget that if your allocation rate is dependent on load, then you may tune things for today's load and reach satisfactory latency and/or throughput,  but if tomorrow the load changes, you're back to same exercise.  If, however, the memory footprint is constant,  then you don't need to worry about that (you may need to tweak other things in the system for higher load, but that's beside the point and always the case anyway).

I also wouldn't discount the effects of initiating a GC on your application performance.  Every time it's triggered, all the execution resources are shifted to servicing that.  As I said before, the data and instruction caches will be polluted; branch target buffers will be polluted; u-op or trace caches polluted; you got calls into kernel involving the scheduler; etc etc.  Sure, we're talking about tens/hundreds of millis here but in those tens/hundreds of millis I could've had tens/hundreds of millions of my own instructions executed, per core! Let me reiterate: if you care about consistent sub-millisecond performance, all these things matter and add up.  If your performance requirements are more lax, then maybe you don't.  Luckily, as Gil pointed out, modern machines are flat out beasts so can dampen these types of things.

Sent from my phone

Benedict Elliott Smith

unread,
Oct 31, 2014, 10:48:19 AM10/31/14
to mechanica...@googlegroups.com
Kirk,

A quick response to your statement about Cassandra going off-heap, since I've done some of the recent work on that. There are two reasons for doing this, and neither are because it's cool.

1) Cassandra is intended for realtime scenarios where lengthy GC pauses are problematic - not everyone can run Zing, and a reduction in allocation without any change on any unrelated workload characteristics necessarily yields fewer GCs;
2) A majority of Cassandra's memory consumption is for many easily grouped, easily managed, relatively long-lived allocations. In this scenario you are wasting a persistent GC CPU burden walking object graphs you know aren't going to be collected. On top of which you're allocating/copying them multiple times, through each stage of the object lifecycle. This further disrupts the behaviour over the short lived objects, causing them to be promoted more often, resulting in a disproportionate proliferation of those lengthy full-GCs.

That's not to say there aren't many other improvements to be made besides, nor that we haven't made many improvements for the on-heap memory characteristics, but I suspect you are responding only to the marketing material, not to the actual development work going on. I encourage you to get involved if you feel like you can make a contribution.

Todd Lipcon

unread,
Oct 31, 2014, 11:29:40 AM10/31/14
to mechanica...@googlegroups.com


On Oct 31, 2014 7:48 AM, "Benedict Elliott Smith" <bellio...@datastax.com> wrote:
>
> Kirk,
>
> A quick response to your statement about Cassandra going off-heap, since I've done some of the recent work on that. There are two reasons for doing this, and neither are because it's cool.
>
> 1) Cassandra is intended for realtime scenarios where lengthy GC pauses are problematic - not everyone can run Zing, and a reduction in allocation without any change on any unrelated workload characteristics necessarily yields fewer GCs;
> 2) A majority of Cassandra's memory consumption is for many easily grouped, easily managed, relatively long-lived allocations. In this scenario you are wasting a persistent GC CPU burden walking object graphs you know aren't going to be collected. On top of which you're allocating/copying them multiple times, through each stage of the object lifecycle. This further disrupts the behaviour over the short lived objects, causing them to be promoted more often, resulting in a disproportionate proliferation of those lengthy full-GCs.
>
> That's not to say there aren't many other improvements to be made besides, nor that we haven't made many improvements for the on-heap memory characteristics, but I suspect you are responding only to the marketing material, not to the actual development work going on. I encourage you to get involved if you feel like you can make a contribution.

+1 -- same goes for HBase. In fact Intel is driving a lot of the work on HBase because without it we can't leverage high ram machines.

Todd

>
> On Fri, Oct 31, 2014 at 8:07 AM, Kirk Pepperdine <ki...@kodewerk.com> wrote:
>>
>>
>> On Oct 30, 2014, at 2:54 PM, Vitaly Davidovich <vit...@gmail.com> wrote:
>>
>>> This is consistent with my experience as well, both in java and .NET (not necessarily the same allocation rate Kirk observes,  but the general notion).
>>>
>>>
>> I’m curious to know what is the threshold on rate that make you want to look at allocations?
>>>
>>> There's a lot of work involved in stopping mutators, doing a GC, and restarting them (safepoint, stack walking, OS kernel calls, icache and dcache pollution, actual GC logic, etc).  Even if collections are done concurrently,  there're resources being taken away from app threads.  Sure, if you have a giant machine where the GC can be fully segregated from app threads, maybe that's ok.  But even then, I'd rather use more of that machine for app logic rather than overhead, so keeping allocations to a minimal is beneficial.
>>>
>>>
>> Humm, I see huge benefits of lowering allocation rates even without GC being a problem. IOWs GC throughput can even be 99% and if the allocation rates are 1G (for example), working to push them down always brings a significant gain.
>>
>>> In native languages, people tend to care about memory allocation more (some of that is natural to the fact that, well, it's an unmanaged environment, but it's also due to being conscious of its performance implications), whereas in java and .NET it's a free-for-all; I blame that on the "allocations are cheap" mantra.  Yes, an individual allocation is cheap, but not when you start pounding the system with them.  Native apps tend to use arenas/pools for long-lived objects (especially ones with specific lifetimes, e.g. servicing a request), and stack memory for temporaries.  Unfortunately,  as we all know, there's no sure way to use stack memory (unless you want to destructure objects into method arguments, which is ugly and brittle at best).
>>>
>>> Quite a few of the big well known "big data"/perf sensitive java projects have hit GC problems, leading to solutions similar to what people do in native land.  I don't have experience with Azul's VM so maybe it really would solve most of these problems for majority cases.  But personally, I'm greedy - I want more of my code running than the VM's! :)
>>
>>
>> This rush to off-heap memory IMHO isn’t justified. I’ve recently been looking at Neo4J, Cassandra, Hazelcast and a whole host of other like technologies. Instead of solving the real problems that have in the implementation they’ve thrown caution to the wind and then gone off heap. The real problem they all have is that they are written is a very memory inefficient way. I bet that if they were to improve the memory efficiency that the motivation to go off-heap would be weak at best. Lets face it.. going off-heap is *cool* and has nothing to do with real requirements.
>>
>> Regards,
>> Kirk
>>
>

Todd Montgomery

unread,
Oct 31, 2014, 12:30:01 PM10/31/14
to mechanica...@googlegroups.com
On Fri, Oct 31, 2014 at 6:29 AM, Vitaly Davidovich <vit...@gmail.com> wrote:

By the way, I'm not a proponent of "zero allocation at runtime at all" - I don't mind allocating where it makes sense, but that's almost always on slow/exceptional paths.  Hot/fast paths, however, should not allocate.  I'm not a masochist that likes to tune GC knobs - my "tuning" involves being a cheapskate with memory :).

 +15! Yes, this!

My point isn't to be draconian here and not allocate at all (we're not talking about a microcontroller for a space rover here) but for people to stop using the "allocation is cheap" mantra in an effort to dismiss being wasteful.  I'm convinced that this statement has caused more harm than good.  People build libraries and allocate needlessly here and there, thinking "no big deal, these are short-lived, die young, don't get copied, don't contribute to young GC time, and I only do a handful of these, blah blah blah".  Then you put tens of these libraries together to form some app, and bam, each one of those "isolated mindset designs", which aren't terrible if they were the only thing running, now bring the system to its knees or leave perf on the table for no good reason.

+100! It's also sad that the JDK fosters a wasteful mindset also...

When I mentioned cleanliness, this is what I mean. All too often within Java frameworks and libraries, we see a sloppy consideration of object lifetime because "GC will take care of it".

And while Gil is correct in that a good collector will take care especially it it is newgen, it also shows a clear lack of mechanical sympathy in design to think that way. Know when the collector will make a hard task trivial and when it will be a problem. I love that the collector will keep objects that I'm using around. It does free me from having to handle some nasty edge conditions and "I gots ta free dis before me be free'in dat".... but it's not a free pass to be sloppy in design.

A collector, even a fantastic one, is no excuse for not thinking about what the implications of your actions are.
 
-- Todd


Rajiv Kurian

unread,
Oct 31, 2014, 1:27:31 PM10/31/14
to mechanica...@googlegroups.com


On Friday, October 31, 2014 6:29:53 AM UTC-7, Vitaly Davidovich wrote:

I've had allocation rates all over the spectrum, anywhere from 1GB to 200MB per second, and this is ranging from java server daemons to .NET gui apps.  The rate which bogs down the app is, of course, dependent on machine class and application workload.

By the way, I'm not a proponent of "zero allocation at runtime at all" - I don't mind allocating where it makes sense, but that's almost always on slow/exceptional paths.  Hot/fast paths, however, should not allocate.  I'm not a masochist that likes to tune GC knobs - my "tuning" involves being a cheapskate with memory :).

All of this depends on the performance requirements of the system, obviously.  If pausing for several tens/hundreds of millis and above every few seconds is ok, then sure, don't contort the code.  If, however, we're talking about consistent performance in sub-millisecond range, then every little waste adds up and you get death by a thousand cuts.  I'm sure we've all seen profiles (cpu + mem) where there's no big elephant in the room but rather a bunch of small leaches.

Very well put. Coming from the C/C++ world when I started working in Java, the profligacy of developers was shocking. There were allocation in paths that I would never expect - like for calculating hash codes or when doing object equality. All of this is considered okay because they are small allocations and "allocations are cheap". A recent application I've had to profile shows this exact death by a thousand cuts phenomenon. So many silly allocations with the mantra that it's cheap it's okay. Another problem from these things is that people tend to not care about object lifetimes at all cos GC. I am conflating these things together but the cheap allocations, not caring about lifetimes (which leads to more of the allocate as you wish culture) and not having control over memory layout all leads to very sloppy programming. I went off heap not because it's cool but because this program just had extremely deterministic object lifetime and was very easily expressed with arrays of C like structs and a few hash maps here and there. Now I allocate a bunch of per thread flyweights that are re-used for everything. The results are massively better. Of course it's not a fair comparison, but the whole "works well enough" attitude is great till it doesn't. To Richard's point there are some applications which have just extremely complex and varied lifetimes. Such complex programs work really well in Java because if you had to write it in C/C++ you'd probably get the exact same layouts as Java and end up using smart pointers (which suck) instead of a good GC. But there are a lot of programs (usually infrastructure work without complex business logic) where object lifetime, access patterns etc are relatively easy to reason about. More rant: Too often have I seen people in Java just allocating like there is no tomorrow and putting it in a concurrent hash map because they did not consider thinking about these details. Java being such a great runtime still ends up giving them way better performance than if they had written the same in some other language, but it's still very sloppy. Actually I always question the need to write Java using unsafe. It is way more low level than expressing the same things in C using structs. When you have already deterministic object lifetimes (i.e. no smart ptrs) and need control on layout why not use C. Where I work it so happens that no one wants/has experience in writing any C so I end up using an error prone unsafe assembly looking pile of code. I can get way better speed using C that looks a lot normal. Maybe object layout combined with a great C4 like GC will help here, but I still don't want to write such programs with deterministic lifetimes in a Java like runtime. I fail to see the benefit besides the whole lots of java developers in the world reason.

By the way, off-heap isn't the only solution I was referring to when I mentioned other projects hitting gc problems.  Stackoverflow, for example, simply started using structs more (this is .net).  Roslyn team (c# compiler written in c#) started reusing certain objects/pooling and avoiding subtle boxing in hot paths, etc.  Cassandra went off-heap, yes, but they determined that their object lifetime just didn't jive well with how collectors want things to be in order to stay efficient.

My point isn't to be draconian here and not allocate at all (we're not talking about a microcontroller for a space rover here) but for people to stop using the "allocation is cheap" mantra in an effort to dismiss being wasteful.  I'm convinced that this statement has caused more harm than good.  People build libraries and allocate needlessly here and there, thinking "no big deal, these are short-lived, die young, don't get copied, don't contribute to young GC time, and I only do a handful of these, blah blah blah".  Then you put tens of these libraries together to form some app, and bam, each one of those "isolated mindset designs", which aren't terrible if they were the only thing running, now bring the system to its knees or leave perf on the table for no good reason.

Again very well put. More death by a thousand cuts.

 

Gil's mechanical sympathy math is nice analysis, but it seems to paper over the cache implications a bit.  Sure, cores can sustain memory throughput of GB/s but it's not like we're not using those channels for other things - I want to use as much of the resources towards my app logic, and not towards runtime infra/overhead.  Object reuse/pooling doesn't have to be slower than allocation, I'm not sure why that's being made to sound like a fact.  If we're talking about threadsafe pools that service multiple threads, sure, but that's not the only type of pool/use case;  you can reuse objects for a single thread of execution. 

Let's also not forget that if your allocation rate is dependent on load, then you may tune things for today's load and reach satisfactory latency and/or throughput,  but if tomorrow the load changes, you're back to same exercise.  If, however, the memory footprint is constant,  then you don't need to worry about that (you may need to tweak other things in the system for higher load, but that's beside the point and always the case anyway).

Why is it inherently considered less complex to tune GC than to write your application properly is beyond me. If you really want to tune GC you need to know your object lifetimes etc. You still don't get good locality from tuning the GC but need to play around with hundreds of different flags. Or you could do this exercise when you write your application. Not every one has access to C4 and nor is it absolutely zero cost. If I have spare cores I'd rather use it for my application. But to be fair like Gil said it might be way more profitable to just write your application quickly (and sloppily) and then pay for a good GC and beefy machines instead of spending potentially more expensive developer hours. This logic does not work for infrastructure pieces (like Cassandra etc) which run on thousands of machines and need to be efficient. They cannot get all their customers to buy Zing nor can Zing solve the problem of Java's everything is a pointer problem. Nor does it work for phones running on small batteries that are taking over the world.
 

I also wouldn't discount the effects of initiating a GC on your application performance.  Every time it's triggered, all the execution resources are shifted to servicing that.  As I said before, the data and instruction caches will be polluted; branch target buffers will be polluted; u-op or trace caches polluted; you got calls into kernel involving the scheduler; etc etc.  Sure, we're talking about tens/hundreds of millis here but in those tens/hundreds of millis I could've had tens/hundreds of millions of my own instructions executed, per core! Let me reiterate: if you care about consistent sub-millisecond performance, all these things matter and add up.  If your performance requirements are more lax, then maybe you don't.  Luckily, as Gil pointed out, modern machines are flat out beasts so can dampen these types of things.

Where efficiency is a concern (as opposed to just pure performance) and you have deterministic lifetimes why waste processing power and energy on things you don't need?.

Rüdiger Möller

unread,
Oct 31, 2014, 3:57:43 PM10/31/14
to
There is another strong value in going off heap (even when using advanced collectors like Zing's):

Datastructures expressed in Java frequently have a redundancy of 80-90% (some math here:http://www.cs.virginia.edu/kim/publicity/pldi09tutorials/memory-efficient-java-tutorial.pdf).

E.g. by using simple offheap (mmapped) serialization based hashmaps, i can fit 7 to 10 times the data into memory. Given that 128/256 GB servers are common, for many if not most applications this means even huge data sets can be kept completely in memory. 
I can store >100 million (medium sized) records in 128 GB of ram. If I put the same to the java heap, this number reduces to ~15 million, given that GC needs space to "breathe" this reduces even further to like 7-10 million records requiring a JVM size of like 128GB (+ Azul collector as this size cannot be handled by standard GC). 

edit: the record structure I am referring to is a hashmap-alike structure holding 90 attributes, where each attribute is represented by an object ('value class').

Kirk Pepperdine

unread,
Nov 3, 2014, 7:00:06 AM11/3/14
to mechanica...@googlegroups.com
HI Gil,


These "keep allocation rates down to 640KB/sec" (oh, right, you said 300MB/sec) guidelines are are purely driven by GC pausing behavior. Nothing else.

No, even when GC is working wonderfully I can still make gains in application performance by reducing allocation rates to approximately the levels mentioned. If the limit should be 20+GB/sec as you suggest then I’m just not seeing it in the field as I get disproportionate gains in performance by simply improving allocation efficiency. This is not a case of you are wrong and I am right. And, I don’t fully understand why, I only have ideas..... and I’d like to dig deeper but at the moment all I can tell you is this is what I’m seeing.

Regards,
Kirk


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
signature.asc

Martin Thompson

unread,
Nov 3, 2014, 7:19:42 AM11/3/14
to mechanica...@googlegroups.com
On 3 November 2014 10:33, Kirk Pepperdine <ki...@kodewerk.com> wrote:
HI Gil,

These "keep allocation rates down to 640KB/sec" (oh, right, you said 300MB/sec) guidelines are are purely driven by GC pausing behavior. Nothing else.

No, even when GC is working wonderfully I can still make gains in application performance by reducing allocation rates to approximately the levels mentioned. If the limit should be 20+GB/sec as you suggest then I’m just not seeing it in the field as I get disproportionate gains in performance by simply improving allocation efficiency. This is not a case of you are wrong and I am right. And, I don’t fully understand why, I only have ideas..... and I’d like to dig deeper but at the moment all I can tell you is this is what I’m seeing.

I can confirm seeing similar things. On machines with higher core counts it seems allocation can become a limiting factor in Hotspot. Amdahl's Law or USL must be kicking in somewhere. This seems to be amplified if you have allocating lambdas in Java 8. I've not had time to investigate but I do know that my reducing allocation I can see significant returns just like Kirk is seeing.

Then we have goodies like this parallel slowdown behaviour in HashMap.


Maybe we need to see more scalability testing on the JVM as our core counts go up.

Gil Tene

unread,
Nov 4, 2014, 2:01:32 AM11/4/14
to <mechanical-sympathy@googlegroups.com>
How about some measurements?

In an empty heap, pure allocation test, running on a 2 year old 2x E5-2690 (16 total physical cores), using 16 concurrently allocating threads, I clocked both HotSpot and Zing doing around 20GB/sec in an 80GB heap. Based on logs Zing spends less than 1% of the overall CPU keeping up with this load (GC is active less than 10% of the time, using a single newgen GC thread, while 16 threads are each active 100% of the time allocating). And jHiccup-observed hiccups remained below 2msec while this was going on. I haven't done the CPU% math for HotSpot, but I'm guessing it's not far off (hiccups remain below 20msec for HotSpot under this load).
[I'd be happy to share the test via GitHub if people are interested in reproducing on other hardware. It's just hard to do from my iPad right now]

The test intentionally minimizes GC work (minimal live set, no retention or promotion) associated with allocation, and the real costs will clearly grow when actual live set and retention costs are incurred on the GC side (and hiccups will dramatically grow for pausing collectors). But the measurements clearly show that allocation is not a limiting factor, or a bottleneck. Both the JVMs and the hardware are able to keep allocating at these rates on a sustained basis.

The way I look to reconcile these measurements with Kirk and Martin's observation is to guess that allocation itself is not a limiting factor in the cases they observe, but that it's rare is closely  tied to some other operation that *is* a limiting factor, and that the work done to reduce allocation rates also reduced whatever the other (real?) bottleneck is. 

Basically, I'm saying that (in these limiting factor cases) allocation is a symptom, and not a cause.

Sent from my iPad
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/jdIhW0TaZQ4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Kirk Pepperdine

unread,
Nov 4, 2014, 2:27:18 AM11/4/14
to mechanica...@googlegroups.com

On Nov 4, 2014, at 8:01 AM, Gil Tene <g...@azulsystems.com> wrote:

> How about some measurements?
>
> In an empty heap, pure allocation test, running on a 2 year old 2x E5-2690 (16 total physical cores), using 16 concurrently allocating threads, I clocked both HotSpot and Zing doing around 20GB/sec in an 80GB heap. Based on logs Zing spends less than 1% of the overall CPU keeping up with this load (GC is active less than 10% of the time, using a single newgen GC thread, while 16 threads are each active 100% of the time allocating). And jHiccup-observed hiccups remained below 2msec while this was going on. I haven't done the CPU% math for HotSpot, but I'm guessing it's not far off (hiccups remain below 20msec for HotSpot under this load).
> [I'd be happy to share the test via GitHub if people are interested in reproducing on other hardware. It's just hard to do from my iPad right now]

I’d like to see the test.

>
> The test intentionally minimizes GC work (minimal live set, no retention or promotion) associated with allocation, and the real costs will clearly grow when actual live set and retention costs are incurred on the GC side (and hiccups will dramatically grow for pausing collectors). But the measurements clearly show that allocation is not a limiting factor, or a bottleneck. Both the JVMs and the hardware are able to keep allocating at these rates on a sustained basis.

This sounds about right. We’re talking GC throughputs that can be greater than 99% so very little retention and promotion.

>
> The way I look to reconcile these measurements with Kirk and Martin's observation is to guess that allocation itself is not a limiting factor in the cases they observe, but that it's rare is closely tied to some other operation that *is* a limiting factor, and that the work done to reduce allocation rates also reduced whatever the other (real?) bottleneck is.
>
> Basically, I'm saying that (in these limiting factor cases) allocation is a symptom, and not a cause.

To this point I completely agree!!! I do not believe that the symptom is the cause. That said, my engagements are driven by symptom not cause. Frequency does hurt more than size but size does matter. Allocation == some execution of code that implies you can see these things with execution profilers.. and indeed frequency is quite visible with an execution profiler. However, I always stress looking at this using a memory profiler as that gives one a view that is more in alignment with the problem and thus provides a better understanding of the cause.

Regards,
Kirk

signature.asc

Jean-Philippe BEMPEL

unread,
Nov 4, 2014, 3:24:34 AM11/4/14
to mechanica...@googlegroups.com
Like Peter Lawrey like to say, allocations thrash your CPU caches. :) Depending on what you are doing with those allocation afterward, it can have a negative side effects like eviction of your read only data, adding more latency in the future to access them, increasing GC frequency also make a negative impact on CPU caches & TLBs (traversing object graph, moving objects, ...).

Martin Thompson

unread,
Nov 4, 2014, 4:13:00 AM11/4/14
to mechanica...@googlegroups.com
Tracking the issue was not trivial the last time I looked at it. In simple isolated benchmarks, allocation did not seem to be an issue. Then again having seen things like the parallel slowdown effects in HashMap last year so it could be something like that but I'd have thought a profiler would have spotted that, like it did with the constructor for HashMap being hot.

This slowdown behaviour seems to show on more complex applications. JPs hypothesis below is as good as any for the likely cause. What I've seen is a general "slowness" when we have a lot of allocation that is not easy to pin point, the slowness is amplified with thread counts. When allocation is reduced the speed up seems to be better than expected. Low thread count apps seem to not suffer the effect from my observations. 

You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Georges Gomes

unread,
Nov 4, 2014, 4:18:55 AM11/4/14
to mechanical-sympathy
More allocation => more minor GCs
More threads => longer time to safepoint ? because all thread need to reach the safepoint => longer pauses

Combined with JP's comment and maybe something else...



Vitaly Davidovich

unread,
Nov 4, 2014, 9:06:38 AM11/4/14
to mechanica...@googlegroups.com

My take on this is that simple GC benchmarks are just like simple CPU benchmarks: they can give you a false impression of what the costs would be when baked into a more complex application.  For example, a simple benchmark may be able to hit in the cache for its entirety whereas real application may miss, and an algorithm that's perhaps less CPU optimized but touches less memory may outperform a CPU optimized one that touches more memory.

Higher thread count, all else being equal, stresses the OS scheduler more (and threads may end up migrating around, further inducing stalls), more frequent young GCs with all their costs outlined several times on this thread, more context switches, TLABs are possibly smaller and get retired quicker, etc etc.  In "real" applications, allocations matter! But really, this is no different than other guidelines for performant code: cheapest code is one that's not called at all.

Sent from my phone

Gil Tene

unread,
Nov 6, 2014, 5:00:51 AM11/6/14
to mechanica...@googlegroups.com
As promised, the allocation example test is up on github: https://github.com/giltene/GilExamples/blob/master/src/main/java/AllocationRateExample.java

You can configure the number of threads that are full-out allocation with the -c option. I used -c 16 to do 20GB/sec on the 2x E5-2690 as mentioned.

The easiest way to monitor allocation rate is to run with -XX:+PrintGCTimeStamps and -verbosegc (or -Xloggc:gc.log), with a fixed size newgen, and compute allocation rate as the Amount of memory collected in each newgen cycle divided by the time between newgen GCs.

The example can be run as a standalone. But you can also make a java agent out of it (by wrapped in a JAR with a "Premain-Class: AllocationRateExample" line in the manifest) so that you can just add pure background allocation pressure to any application and see what the effect is on whatever metrics you measure. This should allow you to put real numbers to the theoretical "allocation hurts my speed" models discussed, by seeing the actual disruption costs. When doing that, I'd start with one or two background threads (which will do "only" a handful of GB/sec), and grow from there to see where things start hurting. Obviously using more threads than there are empty (never used by your app even under) cores would be a false test...
 
Example of running on my laptop (clocking at 2.5GB/sec on a single thread):

Lumpy.local-58% java -Xmx4g -Xms4g -Xmn2g -verbosegc -XX:+PrintGCTimeStamps -cp . AllocationRateExample -c 1 -t 15000

1.396: [GC (Allocation Failure)  1572864K->632K(3932160K), 0.0011270 secs]

2.083: [GC (Allocation Failure)  1573496K->680K(3932160K), 0.0009779 secs]

2.614: [GC (Allocation Failure)  1573544K->664K(3932160K), 0.0009521 secs]

3.275: [GC (Allocation Failure)  1573528K->664K(3932160K), 0.0007557 secs]

3.883: [GC (Allocation Failure)  1573528K->600K(3932160K), 0.0008720 secs]

4.483: [GC (Allocation Failure)  1573464K->664K(4193280K), 0.0012363 secs]

5.628: [GC (Allocation Failure)  2095768K->708K(4193280K), 0.0009738 secs]

6.423: [GC (Allocation Failure)  2095812K->644K(4193280K), 0.0004019 secs]

7.207: [GC (Allocation Failure)  2095748K->548K(4193280K), 0.0004225 secs]

7.933: [GC (Allocation Failure)  2095652K->580K(4193280K), 0.0006007 secs]

8.670: [GC (Allocation Failure)  2095684K->644K(4193280K), 0.0006047 secs]

9.506: [GC (Allocation Failure)  2095748K->644K(4193280K), 0.0005459 secs]

10.323: [GC (Allocation Failure)  2095748K->580K(4193280K), 0.0005144 secs]

11.130: [GC (Allocation Failure)  2095684K->644K(4193280K), 0.0004345 secs]

11.949: [GC (Allocation Failure)  2095748K->644K(4193280K), 0.0004276 secs]

12.752: [GC (Allocation Failure)  2095748K->548K(4193280K), 0.0004619 secs]

13.569: [GC (Allocation Failure)  2095652K->580K(4193280K), 0.0005234 secs]

14.386: [GC (Allocation Failure)  2095684K->612K(4193280K), 0.0008006 secs]

15.234: [GC (Allocation Failure)  2095716K->580K(4193280K), 0.0005469 secs]

16.031: [GC (Allocation Failure)  2095684K->612K(4193280K), 0.0004742 secs]

Done....


Example from a 16 physical core 2x E5-2690 using 16 threads and clocking at ~20GB/sec:

rhine-77% java -Xmx20g -Xms20g -Xmn10g -verbosegc -XX:+PrintGCTimeStamps -cp . AllocationRateExample -c 16 -t 15000

0.421: [GC 7864320K->680K(19660800K), 0.0042570 secs]

0.674: [GC 7865000K->808K(19660800K), 0.0034670 secs]

1.012: [GC 7865128K->840K(19660800K), 0.0035250 secs]

1.326: [GC 7865160K->760K(19660800K), 0.0021870 secs]

1.641: [GC 7865080K->872K(19660800K), 0.0019550 secs]

1.994: [GC 7865192K->680K(20970560K), 0.0026520 secs]

2.418: [GC 10484520K->560K(20969856K), 0.0017370 secs]

2.869: [GC 10484400K->528K(20970368K), 0.0013940 secs]

3.334: [GC 10484112K->496K(20969536K), 0.0017700 secs]

3.794: [GC 10484080K->656K(20970368K), 0.0013320 secs]

4.265: [GC 10484112K->496K(20970368K), 0.0016030 secs]

4.737: [GC 10483952K->528K(20970368K), 0.0016240 secs]

5.206: [GC 10483984K->496K(20970368K), 0.0023400 secs]

5.687: [GC 10483952K->464K(20970432K), 0.0021000 secs]

6.168: [GC 10483984K->592K(20970368K), 0.0016850 secs]

6.644: [GC 10484112K->528K(20970496K), 0.0013200 secs]

7.111: [GC 10484176K->656K(20970432K), 0.0016640 secs]

7.587: [GC 10484304K->464K(20970560K), 0.0015380 secs]

8.064: [GC 10484304K->464K(20970560K), 0.0025490 secs]

8.535: [GC 10484304K->496K(20970624K), 0.0022670 secs]

9.009: [GC 10484400K->656K(20970560K), 0.0026790 secs]

9.488: [GC 10484560K->560K(20970688K), 0.0014580 secs]

9.968: [GC 10484592K->624K(20970624K), 0.0008420 secs]

10.447: [GC 10484656K->624K(20970752K), 0.0015560 secs]

10.929: [GC 10484848K->560K(20970752K), 0.0014460 secs]

11.405: [GC 10484784K->720K(20970816K), 0.0013580 secs]

11.883: [GC 10485072K->560K(20970816K), 0.0015900 secs]

12.355: [GC 10484912K->528K(20970816K), 0.0008100 secs]

12.829: [GC 10484880K->528K(20970816K), 0.0011210 secs]

13.307: [GC 10484880K->496K(20970880K), 0.0025160 secs]

13.784: [GC 10484976K->496K(20970880K), 0.0023670 secs]

14.267: [GC 10484976K->592K(20970944K), 0.0009670 secs]

14.748: [GC 10485136K->592K(20970880K), 0.0016410 secs]

15.224: [GC 10485136K->592K(20970944K), 0.0033210 secs]

15.704: [GC 10485200K->560K(20970944K), 0.0015320 secs]

Done....

Gil Tene

unread,
Nov 6, 2014, 5:16:18 AM11/6/14
to mechanica...@googlegroups.com
The test is focused on the ability to actually allocate at a certain rate, and it clearly shows the rate at which systems actually can allocate. Allocation is a very simple mechanism in all JVMs, and this simple test will experience the same exact mechanisms that any background allocation by any workload will have. The test does not show the allocation impact of work within a critical thread on the thread's own performance. It shows the impact of allocation done by other threads within the JVM on a thread that is otherwise unchanged. By doing this, it helps answer the "is allocation (and not something else) responsible for slowdown in latency/throughput, and for inquired outliers."

A key benefit of how the test is built (can be used as an agent) is that you can add it to any application and measure the actual impact of increased allocation done by "other threads in the application" on whatever metric measurements you care about. These measurements should be done by each person on the environment of their choice, as a way of answering the "how would adding an additional 2GB/sec impact my application" (which would give you good hints for "how would reducing allocation outside of my critical execution path impact my application").

What I expect to see:

As long as the additional allocating threads are added to otherwise-complete-idle CPUs, and as long as there is plenty of newgen heap for the allocation to live in (I'd recommend using 50-80GB heaps with 20+GB newgens):

1. I expect common case "speed" (measured as e.g. median latency on operations) through the application threads to remain relatively unchanged.

2. [on the various pausing collectors out there] I expect GC pauses and measured hiccups to grow in frequency and in size as allocation rates grow.

3. [on Zing] I expect GC cycles to grow in frequency, but GC pause magnitude and hiccup magnitude to remain entirely unchanged.

The changes that applications may measure in #1 above would be most likely attributed to shared L3 cache pollution (and related cross-eviction) effects. These will differ dramatically by application, and I'm curious to see what people actually see on their own apps. So please post results.

-- Gil.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/jdIhW0TaZQ4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Vladimir Sitnikov

unread,
Nov 10, 2014, 3:30:10 AM11/10/14
to mechanica...@googlegroups.com
Gil,

Am I right that you would expect
4. [on HotSpot] Higher promotion rates, thus higher Full GC frequency,
thus decrease in "speed"
?

>I'd recommend using 50-80GB heaps with 20+GB newgens
Not everybody can afford that. My notebook is just 16GiB (see example below).

>"is allocation (and not something else) responsible for slowdown in latency/throughput, and for inquired outliers."
Ultimately it would turn out that DNA of the code author is
responsible for the outliners, while "pure allocation" is fine.

Here's example of "tuning by allocation":
https://github.com/checkstyle/checkstyle/pull/136
Average allocation rate before tuning was 12GB/40sec == 0.3GB/sec.
It's minuscule.
It turns out that just a bit of tuning (e.g. removing non-required
.clone() calls like [2]) cuts the duration of "mvn verify" time in
half: 40 sec -> 20 sec.

Two-fold reduction of a end-to-end by tuning 0.3GB/sec allocations, so
your "expectation #1: speed (measured as e.g. median latency on
operations)" seems to fail for this particular checkstyle bug.

I have no idea if Zing (or using 50-80GB) would solve that problem
automagically, however I believe such requirements are pure overkill.
Checkstyle is used in lots of projects and you can't just install Zing
everywhere and supply it with a 50GB heap.

[2]: https://github.com/checkstyle/checkstyle/blob/master/src/main/java/com/puppycrawl/tools/checkstyle/api/FileText.java#L289

Vladimir

Gil Tene

unread,
Nov 10, 2014, 12:03:35 PM11/10/14
to mechanica...@googlegroups.com


On Monday, November 10, 2014 12:30:10 AM UTC-8, Vladimir Sitnikov wrote:
Gil,

Am I right that you would expect
4. [on HotSpot] Higher promotion rates, thus higher Full GC frequency,
thus decrease in "speed"
?

Adding the test as an agent to an application should not impact it's promotion rates in any significant way. Especially if you add enough newgen space to accommodate it (per my recommendation). None of the allocations that the test does will be promoted, and the allocations the applications itself does should be no more susceptible[that normal] to promotion with the test running in the background (again, assuming you grow the newgen to accommodate the allocation rate).

So no, I don't expect any additional oldness to appear by adding this test.

BTW, even if they did, that would not affect what I call "speed". "speed" in the context I'm using it here is the latency of individual transactions when no stalls are occurring. Things like cache pollution and neighbor-caused eviction can effect this "speed", but GC pauses don't count in this regard.

>I'd recommend using 50-80GB heaps with 20+GB newgens
Not everybody can afford that. My notebook is just 16GiB (see example below).

And your notebook also doesn't have 68GB/sec of memory peak bandwidth, I assume ;-). The point was that a cheap commodity server can do 20GB/sec without breaking much of a sweat.

You can scale the test and the heap down to whatever your machine can do. My laptop seems happy at 2.5GB/sec on a single thread, for example, with only a 2GB newgen to keep that going. 
 
>"is allocation (and not something else) responsible for slowdown in latency/throughput, and for inquired outliers."
Ultimately it would turn out that DNA of the code author is
responsible for the outliners, while "pure allocation" is fine.

Here's example of "tuning by allocation":
https://github.com/checkstyle/checkstyle/pull/136
Average allocation rate before tuning was 12GB/40sec == 0.3GB/sec.
It's minuscule.
It turns out that just a bit of tuning (e.g. removing non-required
.clone() calls like [2]) cuts the duration of "mvn verify" time in
half: 40 sec -> 20 sec.

No doubt that removing useless work from the measured path helps. But that's true for anything consuming CPU cycles, and is easily pointed to with a profiler. 

Allocation is not special here. You can replace "allocation" with "floating point operations" and get the same meaning. I.e. if the allocation work itself is responsible for the wasted cycles, it would show up in your profiler just like a unneeded no-allocation matrix-inversion loop would.

Two-fold reduction of a end-to-end by tuning 0.3GB/sec allocations, so
your "expectation #1: speed (measured as e.g. median latency on
operations)" seems to fail for this particular checkstyle bug.

What you describe is more like "A significant improvement in my code that reduced runtime also resulted in a 2-fold reducing in allocation." Your code change managed to save a lot of cycles. Some of which happened to be spent in allocation. Because some of your waste happened to be in making non-required duplicates of objects that didn't need duplication.
 

I have no idea if Zing (or using 50-80GB) would solve that problem
automagically, however I believe such requirements are pure overkill.
Checkstyle is used in lots of projects and you can't just install Zing
everywhere and supply it with a 50GB heap.

Neither Zing nor an 80GB heap would make un-needed clone() operation take no time. They also don't make integer divides or Math.sqrt() any faster.

Wasted cycles are wasted cycles.

And allocation that actually has a purpose does not slow things down.
 


[2]: https://github.com/checkstyle/checkstyle/blob/master/src/main/java/com/puppycrawl/tools/checkstyle/api/FileText.java#L289

Vladimir

Nitsan Wakart

unread,
Nov 12, 2014, 6:24:38 PM11/12/14
to mechanica...@googlegroups.com



Two-fold reduction of a end-to-end by tuning 0.3GB/sec allocations, so
your "expectation #1: speed (measured as e.g. median latency on
operations)" seems to fail for this particular checkstyle bug.

What you describe is more like "A significant improvement in my code that reduced runtime also resulted in a 2-fold reducing in allocation." Your code change managed to save a lot of cycles. Some of which happened to be spent in allocation. Because some of your waste happened to be in making non-required duplicates of objects that didn't need duplication.

This sounds like a good test case, if anyone can be bothered to run it. Compare the "mvn checkstyle" run:
1. before the fix
2. after the fix (i.e remove the "clone()" call)
3. with equivalent junk allocation (e.g "junk = new byte[estimated allocation caused by clone]; blackhole.consume(junk);")
That should give an idea of the pure allocation cost to the test case.

Reply all
Reply to author
Forward
0 new messages