Choosing hardware to minimize latency

3421 views
Skip to first unread message

Michael Mattoss

unread,
Oct 23, 2014, 4:35:42 PM10/23/14
to mechanica...@googlegroups.com
Hi all,

I wrote an application based around the Disruptor that receives market events and responses accordingly.
I'm trying to decide which hardware configuration I should purchase for the server hosting the app, that would minimize the latency between receiving an event and sending a response.
I should mention that the app receives < 1K messages/second so throughput is not much of an issue.

I tried to do some research but the amount of data/conflicting info is overwhelming so I was hoping some of the experts on this group could offer their insights.

How should I choose the right CPU type? Should I go with a Xeon E5/E7 for the large cache or should I favor a high speed CPU like the i7 4790K (4.4Ghz) since 99% of work is done in a single thread?
What about the new Haswell-E CPU's which seem to strike a good balance between cache size & core speed and also utilize DDR4 memory?
Does it matter if the memory configuration of a 16GB RAM for example is 4x4GB or 2x8GB?
Should I use an SSD HDD or a high performance (15K RPM) mechanical one? (the app runs entirely in memory of course and the BL thread is not I/O bound, but there's a substantial amount of data written sequentially to log files). How about a combination of the two (SDD for the OS and mechanical one for log files?
Is it worth investing in a high performance NIC such as those offered by Solarflare if OpenOnload (kernel bypass) is not used (just for the benefit of CPU offloading)?

Any help, suggestions and tips you may offer would be greatly appreciated.

Thank you!
Michael

Dan Eloff

unread,
Oct 23, 2014, 5:53:20 PM10/23/14
to mechanica...@googlegroups.com
A couple of things:

SolarFlare NICs are really good, CloudFlare tested a bunch of NICs and found them to be without rival: http://blog.cloudflare.com/a-tour-inside-cloudflares-latest-generation-servers/

Haswell-E plus DDR4 is probably the way I'd go, the power savings should eventually pay for themselves, and the higher memory speeds will give a boost to most workloads. Pick something with a high clock speed, judging by your single-threaded workload. But keep in mind it's usually easy with the disruptor to pipeline the workload and utilize more cores. E.g. receive, deserialize, preprocess, postprocess requests on other threads.

SSDs are great, especially if you wait to fsync data to disk for reliability. However, know your write workload, keep in mind write amplification (small writes that are fsynced will write 128kb, or whatever the erase block size is.) In typical workloads SSds are just so much more reliable, and you really don't want to be dealing with a crashed hard drive at 4am if you can avoid it. The price / GB is good enough for most things these days.

Cheers,
Dan





--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gil Tene

unread,
Oct 24, 2014, 1:57:04 AM10/24/14
to
[Some of this is cut/pasted from emails in which I had answered similar questions]

A few clarification points before you answer the question themselves:

You mentioned that throughput is not the main concern (<1K messages/sec expected), and that you want to minimize the latency between receiving an event and sending a response. But when you say "minimize the latency", do you mean:
a) you want to minimize the best case latency
b) you want to minimize the median latency
c) you want to minimize the worst experienced latency
d) some other level (e.g. minimize the 99.9%'lie)

The answers to some of your questions will vary depending on your answers. And the right answer is probably to have goals for at least three of the points above (b, c, and d at least).

If all you care about is the best case and the median latency, you are probably best off with a single socket, highest clock rate, small core count setup with HT turned off. So what if your latency jumps to 10-20msec every once in a while? You'll great best case and median numbers to show off, and you said you don't care about high percentiles.

However, if you care at all about you higher percentiles, you'll want to make sure to have enough cores to accommodate the peak number of runnable threads in your system, with some room to spare, and even then you may want to play some taskset, numactl, or isolcpu games. This is where higher core counts, dual socket setups, and turning HT on comes into play: they all help keep the high percentiles and outliers down (in both frequency and magnitude).

As for NICs. My experience is that burning a core or two on OpenOnload or the like (spinning with no back) is worth a couple of usec in the common case, and that this is especially true for non-saturated workloads (like your <1K/sec), where each latency is measured from a standstill (responding to a single event). If what you care about is latency (and not power, space, or saving a few $), burning those cores is a cheap way to get what you want. Same goes for the disruptor. Burning cores on controlled & dedicated spinning buys you better latency across the board, as long as you have plenty of cores to burn (it gets really bad when you don't have those cores).

When it comes to comparing clock frequencies, I usually tell people to compare the max turbo boost frequencies, rather than the nominal ones. E.g. an E5-2697 V3 has 14 cores, 35MB of L3, and shows 2.6GHz nominal. An E5-2687W V3 has 10 cores, 25MB of L3, and shows 3.1Ghz nominal. And an E5-2667 V3 has 8 cores, 20MB of L3, and shows 3.2Ghz nominal. So you may think the 3.2GHz number is best. But if you look carefully, all three chips have the same Turbo Boost frequency (3.6Ghz), and chances are that with the same number of cores actually busy, they'll all be running at the same frequency. But the larger chip (E5-2697) has more cores, which can help with outliers and allow for more generous "burning" of cores. And it has a significantly bigger L3 which can help with everything.

When you are latency sensitive, your best hint as an initial filter is power envelope: you should be looking at and choosing between chips in the higher power envelope rankings (135-160W usually). The lower power chips will keep your frequencies lower. [Also note that desktop chips tend to be classified by their turbo boost, as in "up to 4.4GHz", while server chips focus on nominal freq. (2.6GHz rather than "up to 3.6GHz"), that probably marketing driven].

For the common single socket vs. dual socket question, and Haswell E vs E5, I usually lean towards the E5s. Yes, something like the i7-4790K has a higher clock rate: 4.0GHz nominal, boosting to 4.4Ghz, but it only has 4 physical cores, and with the disruptor and a good NIC stack, you'll be burning those cores down fast, and may have 10-20msec outliers to deal with quickly, causing you to prematurely give up on spinning, and start paying in common case latency for this lack or spinning cores. Another E5 benefit is memory capacity and bandwidth. i7s tend to peak at 32GB, and have only 2 memory channels, while E5s have 4 channels, support faster DRAMs (e.g. 2133 vs. 1600), and a lot more  memory (typical config of 128GB per socket is commonplace in 2 socket servers today. E.g. that's what the latest EC2 machines seem to be built with).

As for 1 vs. 2 sockets: most 1 vs. 2 socket situations where the 1 socket system "wins" can be matched with nuactl and locking down workload (including their RAM) to one of the two sockets. 2 socket setups have the benefit of allowing you to separate background stuff (like the OS and rarely active processes or threads, and potentially your own non-crticial threads) from you critical workload. They do require more care, e.g. with things like IRQ assignment and making sure your packets are processed (in the network stack) in the same L3 that your consuming threads will be on, but at the end that extra capacity can be a lifesaver. 

Bottom line:

When people ask me about a "good" system for low latency (and this happens a lot), I answer that if $$ or power are not the drivers, and latency is, my current recommendation is to go with a 2 socket E5 (with something like an E5-2697 V3 rightness), and 128GB-256GB of memory. The street price for that is somewhere between $8K and $12K ($3K-$4K of which is spent when focusing on the faster, higher core count Xeons). I also recommend investing in a good 10G NIC in (like Solarflare) which will add some to that price, but buy you a few more usec.

Martin Thompson

unread,
Oct 24, 2014, 2:53:35 AM10/24/14
to mechanica...@googlegroups.com
On 24 October 2014 06:57, Gil Tene <g...@azulsystems.com> wrote:
[Some of this is cut/pasted from emails in which I had answered similar questions]

A few clarification points before you answer the question themselves:

You mentioned that throughput is not the main concern (<1K messages/sec expected), and that you want to minimize the latency between receiving an event and sending a response. But when you say "minimize the latency", do you mean:
a) you want to minimize the best case latency
b) you want to minimize the median latency
c) you want to minimize the worst experienced latency
d) some other level (e.g. minimize the 99.9%'lie)

The answers to some of your questions will vary depending on your answers. And the right answer is probably to have goals for at least three of the points above (b, c, and d at least).

If all you care about is the best case and the median latency, you are probably best off with a single socket, highest clock rate, small core count setup with HT turned off. So what if your latency jumps to 10-20msec every once in a while? You'll great best case and median numbers to show off, and you said you don't care about high percentiles.

However, if you care at all about you higher percentiles, you'll want to make sure to have enough cores to accommodate the peak number of runnable threads in your system, with some room to spare, and even then you may want to play some taskset, numactl, or isolcpu games. This is where higher core counts, dual socket setups, and turning HT on comes into play: they all help keep the high percentiles and outliers down (in both frequency and magnitude).

As for NICs. My experience is that burning a core or two on OpenOnload or the like (spinning with no back) is worth a couple of usec in the common case, and that this is especially true for non-saturated workloads (like your <1K/sec), where each latency is measured from a standstill (repos ending to a single event). If what you care about is latency (and not power, space, or saving a few $), burning those cores is a cheap way to get what you want. Same goes for the disruptor. Burning cores on controlled & dedicated spinning buys you better latency across the board, as long as you have plenty of cores to burn (it gets really bad when you don't have those cores).

When it comes to comparing clock frequencies, I usually tell people to compare the max turbo boost frequencies, rather than the nominal ones. E.g. an E5-2697 V3 has 14 cores, 35MB of L3, and shows 2.6GHz nominal. An E5-2687W V3 has 10 cores, 25MB of L3, and shows 3.1Ghz nominal. And an E5-2667 V3 has 8 cores, 20MB of L3, and shows 3.2Ghz nominal. So you may think the 3.2GHz number is best. But if you look carefully, all three chips have the same Turbo Boost frequency (3.6Ghz), and chances are that with the same number of cores actually busy, they'll all be running at the same frequency. But the larger chip (E5-2697) has more cores, which can help with outliers and allow for more generous "burning" of cores. And it has a significantly bigger L3m which can help with everything.

When you are latency sensitive, your best hint as an initial filter is power envelope: you should be looking at and choosing between chips in the higher power envelope rankings (135-160W usually). The lower power chips will keep your frequencies lower. [Also note that desktop chips tend to be classified by their turbo boost, as in "up to 4.4GHz", while server chips focus on nominal freq. (2.6GHz rather than "up to 3.6GHz"), that probably marketing driven].

For the common single socket vs. dual socket question, and Haswell E vs E5, I usually lean towards the E5s. Yes, something like the i7-4790K has a higher clock rate: 4.0GHz nominal, boosting to 4.4Ghz, but it only has 4 physical cores, and with the disruptor and a good NIC stack, you'll be burning those cores down fast, and may have 10-20msec outliers to deal with quickly, causing you to prematurely give up on spinning, and start paying in common case latency for this lack or spinning cores. Another E5 is memory capacity and bandwidth. i7s tend to leak at 32GB, and have only 2 memory channels, while E5s have 4 channels, support faster DRAMs (e.g. 2133 vs. 1600), and a lot more  memory (typical config of 128GB per socket is commonplace in 2 socket servers today. E.g. that's what the latest EC2 machines seem to be built with).

As for 1 vs. 2 sockets: most 1 vs. 2 socket situations where the 1 socket system "wins" can be matched with nuactl and locking down workload (including their RAM) to one of the two sockets. 2 socket setups have the benefit of allowing you to separate background stuff (like the OS and rarely active processes or threads, and potentially your own non-crticial threads) from you critical workload. They do require more care, e.g. with things like IRQ assignment and making sure your packets are processed (in the network stack) in the same L3 that your consuming threads will be on, but at the end that extra capacity can be a lifesaver. 

Bottom line:

When people ask me about a "good" system for low latency (and this happens a lot), I answer that if $$ or power are not the drivers, and latency is, my current recommendation is to go with a 2 socket E5 (with something like an E5-2697 V3 rightness), and 128GB-256GB of memory. The street price for that is somewhere between $8K and $12K ($3K-$4K of which is spent when focusing on the faster, higher core count Xeons). I also recommend investing in a good 10G NIC in (like Solarflare) which will add some to that price, but buy you a few more usec.

+1 A lot of good advice here.

I'd add to the point of getting a dual socket E5. You want to avoid L3 cache pollution from the OS and other processes on your machine. You can do this by isocups the second socket and run your app there. The Intel L3 cache is inclusive of L1 and L2, so another core can cause your data/code to be evicted from the L3 and thus your private L1/L2, even when threads are pinned to cores. Classic cold start problem after a quite period.

You need to strike a balance between having sufficient cores to *always* be able to run a thread when needed vs the larger L3 caches tend to have higher latencies, but this also needs to consider your working set size in L3. Be careful that a microbenchmark might suggest a smaller L3 is better because your working set is small for the microbenchmark that does radically change in a a real system requiring a larger L3 working set. 

I can vouch for the Solarflare network cards too but also consider Mellanox. In some benchmarks I've seen them be a touch faster again.

If money is no option then consider immersed liquid cooling and over clocking with Turbo always turned up.

Jean-Philippe BEMPEL

unread,
Oct 24, 2014, 3:52:25 AM10/24/14
to mechanica...@googlegroups.com
+1 with Gil, we are also in low latency space with low throughput and this is exactly what we are doing.


On Friday, October 24, 2014 7:57:04 AM UTC+2, Gil Tene wrote:
[Some of this is cut/pasted from emails in which I had answered similar questions]

A few clarification points before you answer the question themselves:

You mentioned that throughput is not the main concern (<1K messages/sec expected), and that you want to minimize the latency between receiving an event and sending a response. But when you say "minimize the latency", do you mean:
a) you want to minimize the best case latency
b) you want to minimize the median latency
c) you want to minimize the worst experienced latency
d) some other level (e.g. minimize the 99.9%'lie)

The answers to some of your questions will vary depending on your answers. And the right answer is probably to have goals for at least three of the points above (b, c, and d at least).

If all you care about is the best case and the median latency, you are probably best off with a single socket, highest clock rate, small core count setup with HT turned off. So what if your latency jumps to 10-20msec every once in a while? You'll great best case and median numbers to show off, and you said you don't care about high percentiles.

However, if you care at all about you higher percentiles, you'll want to make sure to have enough cores to accommodate the peak number of runnable threads in your system, with some room to spare, and even then you may want to play some taskset, numactl, or isolcpu games. This is where higher core counts, dual socket setups, and turning HT on comes into play: they all help keep the high percentiles and outliers down (in both frequency and magnitude).

As for NICs. My experience is that burning a core or two on OpenOnload or the like (spinning with no back) is worth a couple of usec in the common case, and that this is especially true for non-saturated workloads (like your <1K/sec), where each latency is measured from a standstill (responding to a single event). If what you care about is latency (and not power, space, or saving a few $), burning those cores is a cheap way to get what you want. Same goes for the disruptor. Burning cores on controlled & dedicated spinning buys you better latency across the board, as long as you have plenty of cores to burn (it gets really bad when you don't have those cores).

When it comes to comparing clock frequencies, I usually tell people to compare the max turbo boost frequencies, rather than the nominal ones. E.g. an E5-2697 V3 has 14 cores, 35MB of L3, and shows 2.6GHz nominal. An E5-2687W V3 has 10 cores, 25MB of L3, and shows 3.1Ghz nominal. And an E5-2667 V3 has 8 cores, 20MB of L3, and shows 3.2Ghz nominal. So you may think the 3.2GHz number is best. But if you look carefully, all three chips have the same Turbo Boost frequency (3.6Ghz), and chances are that with the same number of cores actually busy, they'll all be running at the same frequency. But the larger chip (E5-2697) has more cores, which can help with outliers and allow for more generous "burning" of cores. And it has a significantly bigger L3 which can help with everything.

When you are latency sensitive, your best hint as an initial filter is power envelope: you should be looking at and choosing between chips in the higher power envelope rankings (135-160W usually). The lower power chips will keep your frequencies lower. [Also note that desktop chips tend to be classified by their turbo boost, as in "up to 4.4GHz", while server chips focus on nominal freq. (2.6GHz rather than "up to 3.6GHz"), that probably marketing driven].

For the common single socket vs. dual socket question, and Haswell E vs E5, I usually lean towards the E5s. Yes, something like the i7-4790K has a higher clock rate: 4.0GHz nominal, boosting to 4.4Ghz, but it only has 4 physical cores, and with the disruptor and a good NIC stack, you'll be burning those cores down fast, and may have 10-20msec outliers to deal with quickly, causing you to prematurely give up on spinning, and start paying in common case latency for this lack or spinning cores. Another E5 benefit is memory capacity and bandwidth. i7s tend to peak at 32GB, and have only 2 memory channels, while E5s have 4 channels, support faster DRAMs (e.g. 2133 vs. 1600), and a lot more  memory (typical config of 128GB per socket is commonplace in 2 socket servers today. E.g. that's what the latest EC2 machines seem to be built with).

As for 1 vs. 2 sockets: most 1 vs. 2 socket situations where the 1 socket system "wins" can be matched with nuactl and locking down workload (including their RAM) to one of the two sockets. 2 socket setups have the benefit of allowing you to separate background stuff (like the OS and rarely active processes or threads, and potentially your own non-crticial threads) from you critical workload. They do require more care, e.g. with things like IRQ assignment and making sure your packets are processed (in the network stack) in the same L3 that your consuming threads will be on, but at the end that extra capacity can be a lifesaver. 

Bottom line:

When people ask me about a "good" system for low latency (and this happens a lot), I answer that if $$ or power are not the drivers, and latency is, my current recommendation is to go with a 2 socket E5 (with something like an E5-2697 V3 rightness), and 128GB-256GB of memory. The street price for that is somewhere between $8K and $12K ($3K-$4K of which is spent when focusing on the faster, higher core count Xeons). I also recommend investing in a good 10G NIC in (like Solarflare) which will add some to that price, but buy you a few more usec.
On Thursday, October 23, 2014 1:35:42 PM UTC-7, Michael Mattoss wrote:

Michael Mattoss

unread,
Oct 24, 2014, 10:44:53 AM10/24/14
to mechanica...@googlegroups.com
Thank you all for your replies and insights!
I would also like to take this opportunity to thank Martin and Gil for developing, and more importantly sharing, the Disruptor and HDRHistogram libraries!

Regarding the latency requirements, I'm mostly concerned with minimizing the 99.9%'lie latency.
Unfortunately, due to budget constraints (somewhere around $5K), dual socket server doesn't look like a viable option. I'm looking into Xeon E5 v3 and Haswell-E processors and have narrowed down the list of possible candidates but I need to do more research.

I have a few more questions:
1. What are pros & cons of disabling HT (other than the obvious reduction of logical cores)?
2. Does it make sense to enable HT to increase the total number of available cores but to isolate some of the physical cores and assigning them only 1 thread so that thread does share the physical core with any other threads?
3. Is there a rule of thumb on how to balance cache size against core speed?

Thank you,
Michael

Gil Tene

unread,
Oct 24, 2014, 11:31:38 AM10/24/14
to mechanica...@googlegroups.com
Budgeting tip: 

Always ask yourself this question: Is saving $5K to $10K in hardware budget worth 1 month of time to market and 1 month of engineering time?

- If the answer is yes, your application isn't worth much. I.e. [by definition] it's value is less than ($5K to $10K per month, minus cost of 1 month of engineering). And your engineers aren't being paid very well.

- If the answer is "That extra $5K in hardware won't buy me that that engineering time and time-to-market" (which is very reasonable for many apps) then what that answer really means your application's probably doesn't care about the 10-20% difference in the lower latency paths that the hardware will cheaply get you, and you can live fine in a $5K two socket server (e.g. E5-2640 V3) with 128GB of memory (that's the true commodity point for servers that are not extremely speed or latency sensitive these days).

If you are going to spend the time to get the low latency parts right, your engineering-hours budget involved in making latency behave well will far outrun any amount of $$ you spend on better commodity hardware. That's true unless you intend to deploy on several 100s of these machines, and even then I'd bet the engineering efforts would cost more than the hardware). You should expect to spend more on engineering than on hardware regardless. But when dealing with low latency, spending an extra $5K-$10K on your hardware to save yourself the time, cost, pain, and duct-tape involved in engineering your workload to fit into a low budget box ALWAYS makes sense.

Jean-Philippe BEMPEL

unread,
Oct 24, 2014, 4:37:23 PM10/24/14
to mechanica...@googlegroups.com
I do not know what are your numbers for low latency and how much you are sensitive to that, but for me as around 100us, It is _vital_ to have 2 sockets because if I have only one socket my latencies literally _double_ (200us) because of the L3 cache pollution due to administrative threads and other Linux processes...
99.9% is also pretty aggressive, specially if you have low throughput.

just my 2 cents.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/jdIhW0TaZQ4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Martin Thompson

unread,
Oct 24, 2014, 5:51:56 PM10/24/14
to mechanica...@googlegroups.com
+1

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Message has been deleted

Gil Tene

unread,
Oct 24, 2014, 7:18:00 PM10/24/14
to mechanica...@googlegroups.com
I couldn't resist. On my browser, Google groups shows this for Martin's last message:
 

+1

To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Steve Morin

unread,
Oct 25, 2014, 11:40:53 AM10/25/14
to mechanica...@googlegroups.com
What's isocups? Saw your reference to it but google didn't come up with much.

Georges Gomes

unread,
Oct 25, 2014, 11:50:34 AM10/25/14
to mechanica...@googlegroups.com
Isolcpu
Isolation of CPU
Google should better answer to that :)

On Sat, Oct 25, 2014, 17:40 Steve Morin <steve...@gmail.com> wrote:
What's isocups? Saw your reference to it but google didn't come up with much.

Martin Thompson

unread,
Oct 25, 2014, 11:50:50 AM10/25/14
to mechanica...@googlegroups.com

On 25 October 2014 16:40, Steve Morin <steve...@gmail.com> wrote:
What's isocups? Saw your reference to it but google didn't come up with much.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Karlis Zigurs

unread,
Oct 25, 2014, 11:54:49 AM10/25/14
to mechanica...@googlegroups.com
Don't forget to use taskset (or numactl) to launch the process on the 'freed' cores/socket afterwards as well.
Also it may be useful to tune garbage collector threads so that they match the dedicated core count (any real world experiences around this would be interesting to hear).

K

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Michael Mattoss

unread,
Oct 25, 2014, 2:24:29 PM10/25/14
to mechanica...@googlegroups.com
Thanks Gil, you raise some good points. I will have to reconsider the hardware budget.
Could you please answer the questions in my previous post regarding enabling/disabling HT and how does one weights cache size against core speed?

Thank you,
Michael

Michael Mattoss

unread,
Oct 25, 2014, 2:39:28 PM10/25/14
to mechanica...@googlegroups.com, jean-p...@bempel.fr
These numbers seem a bit high. At what percentile do you get to 100us latency?
Regarding the 99.9%'lie, I actually think quite the opposite: the less messages you have (such in low throughput environment), the greater the impact of responding late to a message.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Michael Mattoss

unread,
Oct 25, 2014, 2:43:41 PM10/25/14
to mechanica...@googlegroups.com
Out of curiosity, is there an equivalent to isolcpus in the Windows world?


On Saturday, October 25, 2014 6:50:50 PM UTC+3, Martin Thompson wrote:
On 25 October 2014 16:40, Steve Morin <steve...@gmail.com> wrote:
What's isocups? Saw your reference to it but google didn't come up with much.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Jean-Philippe BEMPEL

unread,
Oct 25, 2014, 3:54:29 PM10/25/14
to mechanica...@googlegroups.com
Michael, 

For HT the thing is if the 2 threads shared data in L1 and L2 you are fine otherwise threads are polluting each other pulling lines from L3.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/jdIhW0TaZQ4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Jean-Philippe BEMPEL

unread,
Oct 25, 2014, 4:08:03 PM10/25/14
to mechanica...@googlegroups.com
Consider we get 100us for 50%

The thing is at low rate cache are not very warm. This is why thread affinity and isolcpus are mandatory. 
If you increase throughput you generally observe better latencies. 

For percentiles this is a question of math : it depends the number of measures. For me at 99.9 Gc impact this percentile. 

For us at 99% as we have a low allocation rate minor Gc are not impacting our measures. (this is also because it includes coordinated omissions). 


On Saturday, October 25, 2014, Michael Mattoss <michael...@gmail.com> wrote:
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Michael Mattoss

unread,
Oct 25, 2014, 6:03:45 PM10/25/14
to mechanica...@googlegroups.com, jean-p...@bempel.fr
Perhaps I'm missing something here, but if you warm up the cache before the system switches to steady-state phase and you use thread affinity and isolcpus, why would the cache gets cold, even when the message rate is low? Does the CPU evict cache lines based on time?
As for GC's, I'm happy to say that I don't need to worry about them. I put a lot of work in designing & implementing my system so it would not allocate any memory during steady-state phase. Instead, it allocates all the memory it would need upfront (during the initialization phase) and afterwards, it just recycles objects through pools.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Martin Thompson

unread,
Oct 25, 2014, 7:12:34 PM10/25/14
to mechanica...@googlegroups.com
On 25 October 2014 23:03, Michael Mattoss <michael...@gmail.com> wrote:
Perhaps I'm missing something here, but if you warm up the cache before the system switches to steady-state phase and you use thread affinity and isolcpus, why would the cache gets cold, even when the message rate is low? Does the CPU evict cache lines based on time?

Intel run an inclusive cache policy whereby the L3 cache contains all the a lines in the private L1 and L2s for the same socket. Therefore if anything runs on that socket then the L3 cache will be impacted.  ISOCPUS and pinning does help but is not perfect, ssh will still run on all cores to gather entropy even when cores have been isolated. You also need to be aware that higher power saving states can evict various buffers and caches, e.g. the branch prediction, L0 instruction cache, and decode buffers etc just for the core alone, never mind the L1 and L2 caches.

If you are running a JVM then there are a number of things to watch out for like RMI doing a system GC to support distributed GC whether you need it or not!
 

Gil Tene

unread,
Oct 27, 2014, 1:01:17 AM10/27/14
to mechanica...@googlegroups.com
1. What are pros & cons of disabling HT (other than the obvious reduction of logical cores)?

The cons is loss of logical cores. That one is huge (extremely non-linear), which is why I always recommend people start with HT on, and only try turn it off late in the game, and to very carefully compare the pre/post situation with regard to outliers (i.e. compare multi-hour runs, not those silly 5 minute u-bench things). A "rare" 20msec spike in runnable threads that happens only once per hour can do a lot of damage, and HT on/off *could* make the difference between that spike killing you with outliers and not.

The pro varies a lot. A non-HT core has various "betterness" levels that depend on the core technology (e.g. Haskell vs. Westmere) and on your workload. For resources that are "dynamically partitioned" between HTs on a core, an idle HT has literally no effect. But there are resources that are are statically partitioned when HT is on, and halving those can result in various slowdown effects. E.g. halving the first level TLB (as was the case on some earlier cores) will effect you, but the effect would depend on ($K and 2M page level) locality of access. Branch prediction resources, reservation slots, etc. are also statically partitioned in some cores, and similarly could have an effect that varies form 0 to a lot...

Best thing to do is measure. Measure the effect on *your* common case latency, on *your* outliers, and on *your* throughput. [And report back on what you find]  
 
2. Does it make sense to enable HT to increase the total number of available cores but to isolate some of the physical cores and assigning them only 1 thread so that thread does share the physical core with any other threads?

You can certainly do that, but only if you were already planning to use isolcpus. When you use isolcpus and static thread-t-cpu assignment, keeping neighbors away from specific threads by "burning" 2 HTs for the one thread can be useful. But in a low load
system (like yours), the scheduler will tend to keep stuff spread to the physical cores anyway (again, the benefit of having chips with plenty of cores).
 
3. Is there a rule of thumb on how to balance cache size against core speed?

My rule of thumb is that cache is always better than core speed if you have non-i/o related cache misses. This obviously reverses at some point (At an L3 miss rate of 0.00001% I may go with core speed).

The harder tradeoff is cache size vs. L3 latency. Because the L3 sits in a a ring in modern Xeons, the larger the L3 is the more hops there are in the ring, and the higher the L3 latency can get to some (random) addresses. It gets more complicated, too. E.g. on the newer Haswell chips, going above a certain number of cores may step you into a longer-latency huge L3 ring, or may force you to partition the chip in two (with 2 shorter latency rings with limited bandwidth and higher latency when crossing rings).

But honestly, the detail levels like L3 latency variance between chip sizes is so *way* down in the noise at the point where most people start studying their latency behavior. You may get to actually caring about eventually, but you probably have 20 bigger fish to fry first.

Martin Thompson

unread,
Oct 27, 2014, 4:06:43 AM10/27/14
to mechanica...@googlegroups.com

On 27 October 2014 05:01, Gil Tene <g...@azulsystems.com> wrote:

3. Is there a rule of thumb on how to balance cache size against core speed?

My rule of thumb is that cache is always better than core speed if you have non-i/o related cache misses. This obviously reverses at some point (At an L3 miss rate of 0.00001% I may go with core speed).

The harder tradeoff is cache size vs. L3 latency. Because the L3 sits in a a ring in modern Xeons, the larger the L3 is the more hops there are in the ring, and the higher the L3 latency can get to some (random) addresses. It gets more complicated, too. E.g. on the newer Haswell chips, going above a certain number of cores may step you into a longer-latency huge L3 ring, or may force you to partition the chip in two (with 2 shorter latency rings with limited bandwidth and higher latency when crossing rings).

But honestly, the detail levels like L3 latency variance between chip sizes is so *way* down in the noise at the point where most people start studying their latency behavior. You may get to actually caring about eventually, but you probably have 20 bigger fish to fry first.

I've been spending a good chunk of this year building a low-latency system and lately been tracking the major causes of latency during the profiling phase. The top three causes of latency if you are working in 5-50us range are:

1. Allocation: You simply cannot have any allocation in your main path. Be very careful of lambdas in Java 8. Some that you think should not allocate, do allocate. And don't get me started on APIs like NIO!!!

2. Resource Contention: You always need to have more cores available than the number of threads that need to run. HT helps somewhat but is really a poor person's alternative to having sufficient real cores. You often need more cores than your think.

3. Wait-Free vs Lock-Free algorithms: When a controlling thread takes an interrupt then other threads involved in the same algorithm cannot make progress. Reworking core algorithms to be wait-free, in addition to lock-free, made a much better improvement to the long tail than I expected - even in the case of not having applied ISOCPUs or thread pinning. We are explicitly designing our system so that it behaves as well as possible on a vanilla distribution, as well as when pinning and other tricks are employed.

These sorts of things matter more than the size of your L3 cache. The difference in latencies between different L3 caches sizes can easily be traded off by just slightly reducing the amount of pointer chasing you do in your code.

Michael Mattoss

unread,
Oct 27, 2014, 5:47:43 PM10/27/14
to mechanica...@googlegroups.com
I would like to thank everyone for sharing their knowledge and insights. Much appreciated!
Hopefully, I'll be able to share some data in a few weeks.

Thanks again,
Michael

Tom Lee

unread,
Oct 28, 2014, 12:52:54 AM10/28/14
to mechanica...@googlegroups.com
1. Allocation: You simply cannot have any allocation in your main path. Be very careful of lambdas in Java 8. Some that you think should not allocate, do allocate. And don't get me started on APIs like NIO!!!

Love to hear more about what you have to say here, Martin. We've sort of discovered this the hard way. Anything on the hot path needs to be extremely conservative about allocations (at least on the JVM -- can't speak to native code, but I can imagine that malloc/free could hurt just as much as GC there). Considered pooling briefly in the past, but our working set is often large enough that it basically becomes a choice between potentially exhausting pools or cluttering old gen. Even without pooling, careful reuse of objects between iterations can be fraught with risk as any other allocations on the hot path can quietly shuttle any number of objects over to old gen. It's tough.

Strings are also a particularly painful topic as far as the JVM is concerned: often forced into lots of unwanted copies and allocations. It's almost like you need to rebuild string operations from scratch on top of plain old byte arrays to do anything without a lot of spam.

Maybe a silly question, but are folks doing high performance work with the JVM writing their own string handling routines or something?

Interesting that you bring up NIO too. We've certainly seen some very strange behavior with NIO allocating huge amounts of memory off-heap when left unchecked, but it never occurred to me there might be allocation-related performance issues there.

Cheers,
Tom


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Tom Lee http://tomlee.co / @tglee

Richard Warburton

unread,
Oct 28, 2014, 4:50:36 AM10/28/14
to mechanica...@googlegroups.com
Hi,

Strings are also a particularly painful topic as far as the JVM is concerned: often forced into lots of unwanted copies and allocations. It's almost like you need to rebuild string operations from scratch on top of plain old byte arrays to do anything without a lot of spam.

Maybe a silly question, but are folks doing high performance work with the JVM writing their own string handling routines or something?

I've proposed a patch to String on core libs which lets you more cheaply encode and decode them from/to ByteBuffer and subsequences of byte[]. Its not accepted yet, but the response seemed to be positive. So hopefully Java 9 will be a bit better in this regard. It would be nice to add similar functionality to StringBuffer/StringBuilder as well.

It would be nice to be able to implement a CharSequence that wraps some underlying source of bytes but the problem with this kind of approach is that a lot of APIs take String over CharSequence. So its not that you just end up reimplementing String - you also end up reimplementing a lot of other stuff.

The question always arises in my head - "Why are you using Strings?" If its because you want a human-readable data format then using a binary encoding which has a program which lets you pretty-print the encoding is just as good IMO and avoids a lot of these issues. If you're interacting with an external protocol which is text based I can appreciate that this isn't a decision you can make so easily.

Interesting that you bring up NIO too. We've certainly seen some very strange behavior with NIO allocating huge amounts of memory off-heap when left unchecked, but it never occurred to me there might be allocation-related performance issues there.

One of the specific on-heap NIO related allocate hot-spots seems to be in Selectors, which have to update an internal Hashset inside select(). If you're processing a lot of selects then this can problematic.

regards,

  Richard Warburton

Martin Thompson

unread,
Oct 28, 2014, 5:38:32 AM10/28/14
to mechanica...@googlegroups.com
On 28 October 2014 04:52, Tom Lee <m...@tomlee.co> wrote:
1. Allocation: You simply cannot have any allocation in your main path. Be very careful of lambdas in Java 8. Some that you think should not allocate, do allocate. And don't get me started on APIs like NIO!!!

Love to hear more about what you have to say here, Martin. We've sort of discovered this the hard way. Anything on the hot path needs to be extremely conservative about allocations (at least on the JVM -- can't speak to native code, but I can imagine that malloc/free could hurt just as much as GC there). Considered pooling briefly in the past, but our working set is often large enough that it basically becomes a choice between potentially exhausting pools or cluttering old gen. Even without pooling, careful reuse of objects between iterations can be fraught with risk as any other allocations on the hot path can quietly shuttle any number of objects over to old gen. It's tough.

The summary is that if you are working in the 10s of microsecond space then any allocation will result in significant outliers regardless of JVM. Anyone in HFT is usually well aware of this issue.

However the picture gets more interesting for general purpose applications. I get to profile quite a few applications in the field across a number of domains. The thing this has taught me is that the biggest performance improvements often come from simply reducing the allocation rate. A little bit of allocation profiling and some minor code changes can give big returns. I recommend to all my customers that that run a profiler regularly and keep allocation to modest levels, regardless of application type, and the returns are significant for minimal effort. Our current garbage collectors seem to be not up to the job of coping with the new multicore world and large memory servers - Zing excluded :-)

Allocation is just as big an issue in the C/C++ world. In all worlds it is the reclamation rather than the allocation that is the issue. Just try allocating memory on one thread and freeing it on another in a native language and see how that performs. The big benefit we get in the native world is stack allocation. This has so many benefits besides cheap allocation and reclamation, it is also local in the hot OS page and does not suffer false sharing issues. In the multicore world I so miss stack allocation when using Java. It does not have to be a language feature, Escape Analysis could be better or an alternative like object explosion JRockit can help.

Pools can be useful technique to be used at times. Especially a big win when handing things back and forth between two threads. However pooling should be used with extreme caution and guided by measurement. Objects in pools need to be considered immortal.
 
Strings are also a particularly painful topic as far as the JVM is concerned: often forced into lots of unwanted copies and allocations. It's almost like you need to rebuild string operations from scratch on top of plain old byte arrays to do anything without a lot of spam.

Maybe a silly question, but are folks doing high performance work with the JVM writing their own string handling routines or something?

Not as strange as you would think. Besides the proliferation of XML and JSON encodings we have to deal with text even on some of the highest performance systems. Most financial exchanges still use the tag value encoded FIX protocol. Thankfully many exchanges are now moving to binary encodings but it will take time for the majority to migrate.

Codecs burn more CPU cycles than anything else in most applications. We need better examples for people to copy and the right primitives for building parsers, especially in Java. We need simple things like being able to go between ByteBuffer and String without intermediate copies that go to and from byte[]s, e.g. We need a constructor something like String(ByteBuffer bb, Charset cs) and a String method int getBytes(ByteBuffer bb, Charset cs).

We also need to be able to deal with text encoded numbers, dates, times, etc. in ByteBuffer and byte[] with out allocation and copying. Things like writeInt(int value) as ASCII or UTF-8 to and from ByteBuffer or byte[]. This way the bounds checking can be done once, take less reference indirections per operation, and totally avoid allocation.

When I build parsers I often write my own String handling classes and design APIs so they are composable. That is, have the lower level APIs trade complexity for efficiency and allow them to be wrapped/composed with APIs that are idiomatic or easier to use. This way the end user can have a choice rather than treating them like children who cannot make choices. On a quick scan of the proposed JSON API for Java 9, none of the important lessons seem to have been learned.
 
Interesting that you bring up NIO too. We've certainly seen some very strange behavior with NIO allocating huge amounts of memory off-heap when left unchecked, but it never occurred to me there might be allocation-related performance issues there.

I'd be here all day if you got me started on NIO :-(
 
Just take the simple example of Selector. You do a selectNow() then you get a set of selected key set that you must iterate over. Why not just take a callback to selectNow() or pass in a collection to fill. This and the likes of String.split() are examples of brain dead API design that causes performance issues. Richard and I have worked on this likes of this and he has at least listened to my whines and is trying to do something about it.

Vitaly Davidovich

unread,
Oct 28, 2014, 1:16:15 PM10/28/14
to mechanica...@googlegroups.com

+1.  It doesn't help that the mainstream java community continues to endorse and promote the "allocations are cheap" myth.

Sent from my phone

--

Rajiv Kurian

unread,
Oct 28, 2014, 1:30:32 PM10/28/14
to mechanica...@googlegroups.com


On Tuesday, October 28, 2014 2:38:32 AM UTC-7, Martin Thompson wrote:

On 28 October 2014 04:52, Tom Lee <m...@tomlee.co> wrote:
1. Allocation: You simply cannot have any allocation in your main path. Be very careful of lambdas in Java 8. Some that you think should not allocate, do allocate. And don't get me started on APIs like NIO!!!

Love to hear more about what you have to say here, Martin. We've sort of discovered this the hard way. Anything on the hot path needs to be extremely conservative about allocations (at least on the JVM -- can't speak to native code, but I can imagine that malloc/free could hurt just as much as GC there). Considered pooling briefly in the past, but our working set is often large enough that it basically becomes a choice between potentially exhausting pools or cluttering old gen. Even without pooling, careful reuse of objects between iterations can be fraught with risk as any other allocations on the hot path can quietly shuttle any number of objects over to old gen. It's tough.

The summary is that if you are working in the 10s of microsecond space then any allocation will result in significant outliers regardless of JVM. Anyone in HFT is usually well aware of this issue.

However the picture gets more interesting for general purpose applications. I get to profile quite a few applications in the field across a number of domains. The thing this has taught me is that the biggest performance improvements often come from simply reducing the allocation rate. A little bit of allocation profiling and some minor code changes can give big returns. I recommend to all my customers that that run a profiler regularly and keep allocation to modest levels, regardless of application type, and the returns are significant for minimal effort. Our current garbage collectors seem to be not up to the job of coping with the new multicore world and large memory servers - Zing excluded :-)

Allocation is just as big an issue in the C/C++ world. In all worlds it is the reclamation rather than the allocation that is the issue. Just try allocating memory on one thread and freeing it on another in a native language and see how that performs. The big benefit we get in the native world is stack allocation. This has so many benefits besides cheap allocation and reclamation, it is also local in the hot OS page and does not suffer false sharing issues. In the multicore world I so miss stack allocation when using Java. It does not have to be a language feature, Escape Analysis could be better or an alternative like object explosion JRockit can help.
Of course in the C/C++ world writing custom allocators is one of the first things to do in a high performance project. Culturally speaking it is more acceptable too. If you need to allocate memory on one thread and free it on another - you usually just allocate memory pass a const view into that buffer and pass it via a queue/ring-buffer to the other thread. When you want to free it you pass a message back to the owning thread saying you are done and the owning thread can put it back. Simple conventions like using a circular buffer to directly emplace objects (you have to come up with some header scheme if you want different kinds of objects) works really well especially wrt to fragmentation. On the receiving thread you can look at the header cast to a struct and process it. Again having first class structs make this kind of code much more manageable compared to writing error prone flyweights in Java over ByteBuffers or unsafe blobs. Of course processing logic becomes a big switch statement with object types which might not be the best for icache performance. In projects with very few types of objects one could have a circular buffer for each object type. That leads to better icache utilization. With languages like Rust with lifetime analysis you could go one step further and make sure that pointers to these structs (with lifetime equal to a stack allocated variable) are not leaked into heap allocated memory which could lead to a world of trouble (in Java or C/C++). I think for anything high performance writing a specific allocator is a big win and C/C++ have a huge win here because stack allocation/automatic variables happen to be one such custom allocator that happens to be highly usable in most projects. Any generic malloc/free operation even with thread local caches and cross thread malloc/free has very few guarantees about amount of fragmentation, nature of fragmentation etc. The JVM might have a great allocator but high performance projects can not count on any generic allocator even one as advanced as the JVMs. I am really excited about Rust because it encourages people to think about object lifetimes and layout and the relationship over the course of a program. GCed languages remove this concern and with it the ability to reason about memory layout.

Todd Montgomery

unread,
Oct 28, 2014, 1:51:16 PM10/28/14
to mechanica...@googlegroups.com
BTW, I'll be talking at QCon SF next week about the lessons that Martin, Richard, and I have had with Java 8 with the project that Martin has mentioned.

The primary allocation problem with Selector is as Martin mentioned, the selectedKeySet that is in use internally. It is possible to avoid the allocations within the HashSet by using your own set. This is the approach that netty has taken, for example. But the API is also close.... but not quite. Simple changes would make NIO so much better. There are also some DatagramChannel specific annoyances to deal with. The, IMHO, unneeded locking built into channels and Selector are quite disappointing for someone used to the native networking world to consider acceptable designs. Which leads to optimizations like finding ways to NOT use Selector unless the number of channels reaches a certain threshold, etc.

For C/C++, there are some additional things to consider. The freeing on different thread than allocation is a concern that can be handled via design marginally well. Normally allocators often work very poorly with some data structures due to fragmentation. Good thing is that most mechanically sympathetic data structures avoid that naturally. As Martin cautions, pooling can be quite useful or a total nightmare when done badly. I must say that as a general rule for C/C++, I tend to: (1) use the stack for copy-on-write/copy-on-save, and (2) use single threaded designs with ARC. I consider usage of std::shared_ptr to often create a smell.

-- Todd

Eric Wong

unread,
Oct 28, 2014, 2:28:26 PM10/28/14
to mechanica...@googlegroups.com
Martin Thompson <mjp...@gmail.com> wrote:
> Allocation is just as big an issue in the C/C++ world. In all worlds it is
> the reclamation rather than the allocation that is the issue. Just try
> allocating memory on one thread and freeing it on another in a native
> language and see how that performs. The big benefit we get in the native

Depends on the malloc implementation, of course. The GPLv3
locklessinc.com malloc handles it fine (as does my experimental
dlmalloc+wfcqueue fork[1])

I think the the propagation of things like Userspace-RCU (where
cross-thread malloc/free is a common pattern) will encourage
malloc implementations to treat this pattern better.

[1] git clone git://80x24.org/femalloc || http://femalloc.80x24.org/README

Martin Thompson

unread,
Oct 28, 2014, 2:56:59 PM10/28/14
to mechanica...@googlegroups.com
Does this implementation need to use a CAS to free the memory when returning it? If not, how does it avoid it? 

Eric Wong

unread,
Oct 28, 2014, 3:12:34 PM10/28/14
to mechanica...@googlegroups.com
No CAS. In femalloc, free via wfcqueue append only uses xchg.

git clone git://git.lttng.org/userspace-rcu.git
See ___cds_wfcq_append in urcu/static/wfcqueue.h

The wait-free free cost is shifted to allocation when it reads from the
queue, which may block if the free-ing thread is in the middle of an
append (I think the blocking is unlikely, but I haven't tested
this enough...)

locklessinc uses xchg in a loop to implement hazard pointers
and lock-free (but not wait-free) stack. It can loop inside the
free-ing thread, but not allocating thread (I'm less familiar
with this one).

Rüdiger Möller

unread,
Oct 28, 2014, 3:15:40 PM10/28/14
to mechanica...@googlegroups.com

Am Dienstag, 28. Oktober 2014 10:38:32 UTC+1 schrieb Martin Thompson:
Just take the simple example of Selector. You do a selectNow() then you get a set of selected key set that you must iterate over. Why not just take a callback to selectNow() or pass in a collection to fill. This and the likes of String.split() are examples of brain dead API design that causes performance issues. Richard and I have worked on this likes of this and he has at least listened to my whines and is trying to do something about it.

:)) I feel your pain .. 90% of java api's/libraries look that way. Great if jee standard api's enforce object creation by design such that the quality of the concrete implementation does not make a difference anymore.

Rüdiger Möller

unread,
Oct 28, 2014, 3:17:34 PM10/28/14
to mechanica...@googlegroups.com


Am Dienstag, 28. Oktober 2014 09:50:36 UTC+1 schrieb Richard Warburton:

The question always arises in my head - "Why are you using Strings?" If its because you want a human-readable data format then using a binary encoding which has a program which lets you pretty-print the encoding is just as good IMO and avoids a lot of these issues. If you're interacting with an external protocol which is text based I can appreciate that this isn't a decision you can make so easily.


Yep internet is running in debug mode, part of the problem is mixing the behaviour and encoding when defining a protocol. That's common unfortunately.


Martin Thompson

unread,
Oct 28, 2014, 3:18:37 PM10/28/14
to mechanica...@googlegroups.com
OK I understand. This is as fast as things can get when contented. However it is still fundamentally contended, the XHG while not a CAS is still a fully fenced instruction that must wait on the store buffer to drain before making progress. You don't want to do this too often in the middle of a high-performance algorithm.
 

Gil Tene

unread,
Oct 28, 2014, 3:52:50 PM10/28/14
to mechanica...@googlegroups.com
<< Putting on my contrarian "but allocations ARE cheap" hat. >>

I disagree, guys. These notions of "allocation causes outliers" is bogus. Pauses and glitches are not an inherent side-effect of allocation. They are side effects of bad allocator implementations.

The notions that "allocation slows things down" are also misleading. I.e. pooling is not an alternative to allocation., Pooling *IS* allocation, in a different, slower, but in a maybe-better-in-outliers implementation. Pooling is usually *slower* (not faster) than straight-line allocation.

I'm in no way advocating for allocation-for-no-good-reason. From a common-case speed perspective, efficient code that can completely avoid bringing in fresh line into the cache and pushing stale ones out will obviously tend to be faster (in the common case) than code that does move many such cache lines in and out. I'm not arguing otherwise, but that's ALL there is to the fundamental cost of allocation.

And that cost is often well hidden, and can approach zero. E.g. a good streaming allocator (as in 2MB TLABs) has virtually perfect access pattern predictability, so the incoming latency cost does not really exist (assuming the allocation pipeline works right).

And it's true that even such efficient allocation patterns will still evict other (cold) lines from the various cache hierarchies (L1, L2, L3, and other core's L2 and L1 in turn due to inclusiveness). But its mostly cold lines are susceptible to being pushed out of the L3 (and neighboring L2/L1). Hot lines that are part of your active working set will stay in the cache, due to the cache's LRU and LRU-approximating policies.

So where does that big bugaboo come from? That thing that makes allocation behave badly, and that causes smart people with real experience to work so hard to avoid it? 

The answer can often be found in a forced choice of runtime and platform, one that has terrible allocation side-effects. The hint is in statements like "...we wanted to make it work well on vanilla <runtime> implementations". The real cost and issue with allocation is in how it is handled in those vanilla implementations. It's simple: if you use a crap allocator, you'll get crap allocation-related performance artifacts. But conversely, if you use an allocator that supports the qualities you want (e.g. minimize allocation-related outlier artifacts and side-effects), that doesn't have to be the case. 

If the code you are building is *not* intended to be used by the world-as-a-whole on generic vanilla runtime, and you are trying to get stuff done and out to market quickly, using a good allocator (instead of engineering all allocations away) can save you tons of engineering time, efforts, money and headaches. The easiest place this pays off is in not having to write your own zero-allocation code for everything, and in not having to force your developers to only use "code-written-in-this-building". Leverage == money, time, etc. etc. etc.

With all that said, it's true that in Java, even with the best collector on the market (guess which one I'm talking about? Yup, the one that's >1000x better than all others in outlier behavior...), allocation will eventually lead to outlier glitches that range into the hundreds of micro-seconds (yes, that's micro, not milli). And with that happening once every several seconds (say, at a 1GB/sec sustained allocation rate in a 100GB heap), your allocation-related 99.999%'lie may creep up to hundreds of usec as a result. And it's true that with no allocation whatsoever done in your process, those outlier artifacts *could* have been completely avoided for threads that sit on dedicated isolcpus core... but normally, those artifacts are so far below the levels of other, much bigger outlier causes, that you won't be able to measure them.

The reality is that those situations (threads on dedicated isolcpus cores) are the ONLY case in which you would be able to measure the outlier effects of allocation when a good allocator and a good collector is involved. In all the situations I know of (e.g. bot using isolcpus, or having other threads that allocate in the same process), the noise generated by OS-level artifacts (scheduling, interrupts, paging, and other on-demand funny stuff) present far bigger outliers, and far more frequent ones, than those that are caused by allocation. 

For a practical example of how happy a good allocator (and associated background collector) can make people, and how much time they can save in avoiding the "workaround bugaboo" here is a link to a posting that really made my day yesterday: http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2014-October/002045.html . We get a lot of similar feedback from people running low latency code on our platform, but most of those don't get to post publicly about it...

-- Gil.


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Todd Montgomery

unread,
Oct 28, 2014, 4:38:42 PM10/28/14
to mechanica...@googlegroups.com
There is an additional sentiment that I wish I understood. Seems that many people believe that parsing is easier with ASCII.

-- Todd

Martin Thompson

unread,
Oct 28, 2014, 4:50:54 PM10/28/14
to mechanica...@googlegroups.com
On 28 October 2014 19:52, Gil Tene <g...@azulsystems.com> wrote:
<< Putting on my contrarian "but allocations ARE cheap" hat. >>

I disagree, guys. These notions of "allocation causes outliers" is bogus. Pauses and glitches are not an inherent side-effect of allocation. They are side effects of bad allocator implementations.

By allocation we really need to be talking about object lifecycle management for clarity. Allocation is such a small part of the domain.

Most GC implementations - including C4 I believe - are based on the weak generational hypothesises, i.e. most objects die young. Have you measured an application making significant use of persistent data structures, such a FP style tries? They allocate a lot and the objects live medium term typically. Our current generation garbage collectors really suck at dealing with that type of lifecycle.

I think as the years go on this is going to become a really hot topic. Persistent data structures are really nice to use and reason about but have serious implementation issues that need considered. For me this is a fascinating topic that deserves serious research.

Martin...

Todd Montgomery

unread,
Oct 28, 2014, 4:54:59 PM10/28/14
to mechanica...@googlegroups.com
Our favorite neighborhood allocator is a tremendous revolutionary step up from other allocators. I can't say enough good about it.

However, a lot of times, the need for so much allocation is unfounded. For the case of Selector, it would seem to be quite shortsighted to consider HashSet to be a prudent thing to _lock_ in and NOT allow it to be the callers choice. This isn't just a performance issue. It would seem to be an API issue as well. Especially with lambdas being slapped in, I mean, so prevalent through the Java 8 API. But that is just an example. Many places in the JDK there are some baffling choices made wrt object ownership, creation, etc.

I don't think it is really simply the desire to have things work on other runtimes really. It's somewhat of a code style, API, usability thing as well.

-- Todd

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Gil Tene

unread,
Oct 28, 2014, 6:14:22 PM10/28/14
to mechanica...@googlegroups.com
Just because we can leverage a cool trick to be 10x-20x more efficient at some workload doesn't mean that the other sort is "bad" ;-)

C4 does leverage the weak generational hypothesis for efficiency, but not for mutator speed or for lack of outliers. The algorithm uses the exact same concurrent Mark-Compact mechanism for newgen and olden, which makes it works great even for workloads that exhibit significant churn long and medium lived objects (like caches, moving-time-window workloads, and many large in-memory analytic workloads do).

To be clear about what I mean by efficiency: I mean "amount of background CPU cycles spent per unit of allocation". This efficiency effect is pretty dramatic, as in usually 10x-20x fewer background GC cycles spent compared to single generation runs. So if we were to spend, say 0.5% of the system's available cpu cycles on GC in a given generational-friendly workload, that would grow to 5-10% if the workload was completely "adversarial" and the newgen filter was not getting us the efficiency it normally does. [And yes, these are typical numbers for latency sensitive apps, even when allocating multiple GC/sec].

Note that this math relates to background work and cpu cycles spent outside of your latency path. This is work that other cores (on potentially other sockets) are doing, without slowing down your latency sensitive threads or causing them any extra outliers.

Luckily, there is a way to buy back even this lost efficiency with an algorithm like C4 (whose cost is purely linear to the live set): For every doubling of empty memory in the heap, C4's efficiency doubles, and this hold even for adversarial, non-generational-friendly workloads. 

A different way to think about it is this: "all" the weak generational hypothesis buys us in Zing is the ability to keep up with the same workload (live set, allocation rate, promotion rate) in a smaller heap. But we can gain the same benefit by throwing cheap, empty memory at the problem. E.g. growing you empty heap from 5GB to 50-10GB will give the same 10x-20x efficiency boost. l And these days, an extra 100GB means no more than an extra $1700.

Gil Tene

unread,
Oct 28, 2014, 6:20:54 PM10/28/14
to mechanica...@googlegroups.com
+1.

Slow, inefficient code is slow and inefficient... And allocation for no good reason is plain waste.

But there is a huge difference between optimizing code, which often includes removing *wasteful* allocation from hot paths, and going for a "thou shalt not allocate anything" approach in the hot path, or in the entire process. 

The "thou shalt not allocate" approaches always normally driven by outliers, not by speed or the need to optimize critical paths. Anything in the process that allocates (critical path or not) will cause an eventual GC to occur, and if the GC you use causes pain, your predictable reaction to that pain would be "thou shalt not allocate"...

But when that pain isn't there, allocation stop being a sin. It's still not something you should do for no good reason, and you should still be optimizing your *hot* paths and your latency critic al paths, but allocation is virtually always faster than pooling, for example, and when one of the two is needed on a latency critical path, allocation usually results in a lower latency.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Todd Montgomery

unread,
Oct 28, 2014, 6:45:57 PM10/28/14
to mechanica...@googlegroups.com
I look at "thou shalt not allocate anything in the process" as unrealistic in a large system. Someone has to allocate if any state is kept at all. It's like "no side effects" as FP proponents point out. No state nor side effects? Well, that is a pretty nice, nearly useless box...

However, the "thou shalt not allocate anything in the data path" is just as much a cleanliness and efficiency point. I've done this even in C. A lot of times this is due to not wanting to leak state or leak memory or prevent leaky abstractions. But I think there is something to be said for making sure that state lifetime and object lifetime are tightly controlled and well thought out. For me, this usually leads to cleaner and simpler designs. It's so often overlooked.

One thing that concerns me with Java is that it is insanely easy to not care about object lifetime. Maybe this is good. But it does concern me as it is not possible to totally be ignorant of memory usage. Or else you eventually hold onto everything and OOM. But I do agree that we don't need to hold onto memory and hoard it as we would something like FDs, mmaps, etc. It's not THAT precise, we can make some very good enlightened tradeoffs. A little more memory for more efficiency is an awesome one that is a fundamentally good one.

It's too bad that often as a discipline we look at extremes way too much. All the interesting stuff is in the shades of grey zones....

-- Todd

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Kirk Pepperdine

unread,
Oct 29, 2014, 3:27:00 AM10/29/14
to mechanica...@googlegroups.com
The internet running in debug mode is one of the most brilliant statements on the state of logging today that I’ve ever heard.. I want to quote you!!!
All too common.. in fact I’d say that most people want a number of things and it’s all rolled up into one thing that doesn’t work well for anything… If it’s baked into a protocol than it’s almost impossible to untangle.

Regards,
Kirk
signature.asc

Todd Montgomery

unread,
Oct 29, 2014, 11:29:24 AM10/29/14
to mechanica...@googlegroups.com
Unless you decide to throw out the protocol and start over again. ala HTTP/2.

-- Todd

Kirk Pepperdine

unread,
Oct 30, 2014, 3:28:54 AM10/30/14
to mechanica...@googlegroups.com
Martin,

I read one of your earlier posts about how allocations cost and I have to say it is completely consistent with my experiences also. My current threshold for starting to look at memory efficiency as a way of improving performance is ~3-400MB/sec. Yeah, that’s it!!! If allocation rates exceed that then I know I can get significant gains in performance but working on memory efficiency. You can’t tune this out by turning GC knobs on the JVM. You need to find out who’s allocating and kill it.

I hear what Gil is saying and as usual, it’s difficult to argue against his deep knowledge of how things work but then… every time I’ve seen a team exhaust the gains they get with an execution profiler I’ll look at a churn and if it's higher than that threshold I know I’m going to look like a genius in that engagement (even though we all know the truth ;-))

Regards,
Kirk

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
signature.asc

Mamontov Ivan

unread,
Oct 30, 2014, 7:59:17 AM10/30/14
to mechanica...@googlegroups.com
What about true RISC processors, for example POWER8 with faster processor, with more cache and memory?

пятница, 24 октября 2014 г., 8:57:04 UTC+3 пользователь Gil Tene написал:
[Some of this is cut/pasted from emails in which I had answered similar questions]

A few clarification points before you answer the question themselves:

You mentioned that throughput is not the main concern (<1K messages/sec expected), and that you want to minimize the latency between receiving an event and sending a response. But when you say "minimize the latency", do you mean:
a) you want to minimize the best case latency
b) you want to minimize the median latency
c) you want to minimize the worst experienced latency
d) some other level (e.g. minimize the 99.9%'lie)

The answers to some of your questions will vary depending on your answers. And the right answer is probably to have goals for at least three of the points above (b, c, and d at least).

If all you care about is the best case and the median latency, you are probably best off with a single socket, highest clock rate, small core count setup with HT turned off. So what if your latency jumps to 10-20msec every once in a while? You'll great best case and median numbers to show off, and you said you don't care about high percentiles.

However, if you care at all about you higher percentiles, you'll want to make sure to have enough cores to accommodate the peak number of runnable threads in your system, with some room to spare, and even then you may want to play some taskset, numactl, or isolcpu games. This is where higher core counts, dual socket setups, and turning HT on comes into play: they all help keep the high percentiles and outliers down (in both frequency and magnitude).

As for NICs. My experience is that burning a core or two on OpenOnload or the like (spinning with no back) is worth a couple of usec in the common case, and that this is especially true for non-saturated workloads (like your <1K/sec), where each latency is measured from a standstill (responding to a single event). If what you care about is latency (and not power, space, or saving a few $), burning those cores is a cheap way to get what you want. Same goes for the disruptor. Burning cores on controlled & dedicated spinning buys you better latency across the board, as long as you have plenty of cores to burn (it gets really bad when you don't have those cores).

When it comes to comparing clock frequencies, I usually tell people to compare the max turbo boost frequencies, rather than the nominal ones. E.g. an E5-2697 V3 has 14 cores, 35MB of L3, and shows 2.6GHz nominal. An E5-2687W V3 has 10 cores, 25MB of L3, and shows 3.1Ghz nominal. And an E5-2667 V3 has 8 cores, 20MB of L3, and shows 3.2Ghz nominal. So you may think the 3.2GHz number is best. But if you look carefully, all three chips have the same Turbo Boost frequency (3.6Ghz), and chances are that with the same number of cores actually busy, they'll all be running at the same frequency. But the larger chip (E5-2697) has more cores, which can help with outliers and allow for more generous "burning" of cores. And it has a significantly bigger L3 which can help with everything.

When you are latency sensitive, your best hint as an initial filter is power envelope: you should be looking at and choosing between chips in the higher power envelope rankings (135-160W usually). The lower power chips will keep your frequencies lower. [Also note that desktop chips tend to be classified by their turbo boost, as in "up to 4.4GHz", while server chips focus on nominal freq. (2.6GHz rather than "up to 3.6GHz"), that probably marketing driven].

For the common single socket vs. dual socket question, and Haswell E vs E5, I usually lean towards the E5s. Yes, something like the i7-4790K has a higher clock rate: 4.0GHz nominal, boosting to 4.4Ghz, but it only has 4 physical cores, and with the disruptor and a good NIC stack, you'll be burning those cores down fast, and may have 10-20msec outliers to deal with quickly, causing you to prematurely give up on spinning, and start paying in common case latency for this lack or spinning cores. Another E5 benefit is memory capacity and bandwidth. i7s tend to peak at 32GB, and have only 2 memory channels, while E5s have 4 channels, support faster DRAMs (e.g. 2133 vs. 1600), and a lot more  memory (typical config of 128GB per socket is commonplace in 2 socket servers today. E.g. that's what the latest EC2 machines seem to be built with).

As for 1 vs. 2 sockets: most 1 vs. 2 socket situations where the 1 socket system "wins" can be matched with nuactl and locking down workload (including their RAM) to one of the two sockets. 2 socket setups have the benefit of allowing you to separate background stuff (like the OS and rarely active processes or threads, and potentially your own non-crticial threads) from you critical workload. They do require more care, e.g. with things like IRQ assignment and making sure your packets are processed (in the network stack) in the same L3 that your consuming threads will be on, but at the end that extra capacity can be a lifesaver. 

Bottom line:

When people ask me about a "good" system for low latency (and this happens a lot), I answer that if $$ or power are not the drivers, and latency is, my current recommendation is to go with a 2 socket E5 (with something like an E5-2697 V3 rightness), and 128GB-256GB of memory. The street price for that is somewhere between $8K and $12K ($3K-$4K of which is spent when focusing on the faster, higher core count Xeons). I also recommend investing in a good 10G NIC in (like Solarflare) which will add some to that price, but buy you a few more usec.

On Thursday, October 23, 2014 1:35:42 PM UTC-7, Michael Mattoss wrote:
Hi all,

I wrote an application based around the Disruptor that receives market events and responses accordingly.
I'm trying to decide which hardware configuration I should purchase for the server hosting the app, that would minimize the latency between receiving an event and sending a response.
I should mention that the app receives < 1K messages/second so throughput is not much of an issue.

I tried to do some research but the amount of data/conflicting info is overwhelming so I was hoping some of the experts on this group could offer their insights.

How should I choose the right CPU type? Should I go with a Xeon E5/E7 for the large cache or should I favor a high speed CPU like the i7 4790K (4.4Ghz) since 99% of work is done in a single thread?
What about the new Haswell-E CPU's which seem to strike a good balance between cache size & core speed and also utilize DDR4 memory?
Does it matter if the memory configuration of a 16GB RAM for example is 4x4GB or 2x8GB?
Should I use an SSD HDD or a high performance (15K RPM) mechanical one? (the app runs entirely in memory of course and the BL thread is not I/O bound, but there's a substantial amount of data written sequentially to log files). How about a combination of the two (SDD for the OS and mechanical one for log files?
Is it worth investing in a high performance NIC such as those offered by Solarflare if OpenOnload (kernel bypass) is not used (just for the benefit of CPU offloading)?

Any help, suggestions and tips you may offer would be greatly appreciated.

Thank you!
Michael

Vitaly Davidovich

unread,
Oct 30, 2014, 9:54:13 AM10/30/14
to mechanica...@googlegroups.com

This is consistent with my experience as well, both in java and .NET (not necessarily the same allocation rate Kirk observes,  but the general notion).

There's a lot of work involved in stopping mutators, doing a GC, and restarting them (safepoint, stack walking, OS kernel calls, icache and dcache pollution, actual GC logic, etc).  Even if collections are done concurrently,  there're resources being taken away from app threads.  Sure, if you have a giant machine where the GC can be fully segregated from app threads, maybe that's ok.  But even then, I'd rather use more of that machine for app logic rather than overhead, so keeping allocations to a minimal is beneficial.

In native languages, people tend to care about memory allocation more (some of that is natural to the fact that, well, it's an unmanaged environment, but it's also due to being conscious of its performance implications), whereas in java and .NET it's a free-for-all; I blame that on the "allocations are cheap" mantra.  Yes, an individual allocation is cheap, but not when you start pounding the system with them.  Native apps tend to