Choosing hardware to minimize latency

3454 views
Skip to first unread message

Michael Mattoss

unread,
Oct 23, 2014, 4:35:42 PM10/23/14
to mechanica...@googlegroups.com
Hi all,

I wrote an application based around the Disruptor that receives market events and responses accordingly.
I'm trying to decide which hardware configuration I should purchase for the server hosting the app, that would minimize the latency between receiving an event and sending a response.
I should mention that the app receives < 1K messages/second so throughput is not much of an issue.

I tried to do some research but the amount of data/conflicting info is overwhelming so I was hoping some of the experts on this group could offer their insights.

How should I choose the right CPU type? Should I go with a Xeon E5/E7 for the large cache or should I favor a high speed CPU like the i7 4790K (4.4Ghz) since 99% of work is done in a single thread?
What about the new Haswell-E CPU's which seem to strike a good balance between cache size & core speed and also utilize DDR4 memory?
Does it matter if the memory configuration of a 16GB RAM for example is 4x4GB or 2x8GB?
Should I use an SSD HDD or a high performance (15K RPM) mechanical one? (the app runs entirely in memory of course and the BL thread is not I/O bound, but there's a substantial amount of data written sequentially to log files). How about a combination of the two (SDD for the OS and mechanical one for log files?
Is it worth investing in a high performance NIC such as those offered by Solarflare if OpenOnload (kernel bypass) is not used (just for the benefit of CPU offloading)?

Any help, suggestions and tips you may offer would be greatly appreciated.

Thank you!
Michael

Dan Eloff

unread,
Oct 23, 2014, 5:53:20 PM10/23/14
to mechanica...@googlegroups.com
A couple of things:

SolarFlare NICs are really good, CloudFlare tested a bunch of NICs and found them to be without rival: http://blog.cloudflare.com/a-tour-inside-cloudflares-latest-generation-servers/

Haswell-E plus DDR4 is probably the way I'd go, the power savings should eventually pay for themselves, and the higher memory speeds will give a boost to most workloads. Pick something with a high clock speed, judging by your single-threaded workload. But keep in mind it's usually easy with the disruptor to pipeline the workload and utilize more cores. E.g. receive, deserialize, preprocess, postprocess requests on other threads.

SSDs are great, especially if you wait to fsync data to disk for reliability. However, know your write workload, keep in mind write amplification (small writes that are fsynced will write 128kb, or whatever the erase block size is.) In typical workloads SSds are just so much more reliable, and you really don't want to be dealing with a crashed hard drive at 4am if you can avoid it. The price / GB is good enough for most things these days.

Cheers,
Dan





--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Gil Tene

unread,
Oct 24, 2014, 1:57:04 AM10/24/14
to
[Some of this is cut/pasted from emails in which I had answered similar questions]

A few clarification points before you answer the question themselves:

You mentioned that throughput is not the main concern (<1K messages/sec expected), and that you want to minimize the latency between receiving an event and sending a response. But when you say "minimize the latency", do you mean:
a) you want to minimize the best case latency
b) you want to minimize the median latency
c) you want to minimize the worst experienced latency
d) some other level (e.g. minimize the 99.9%'lie)

The answers to some of your questions will vary depending on your answers. And the right answer is probably to have goals for at least three of the points above (b, c, and d at least).

If all you care about is the best case and the median latency, you are probably best off with a single socket, highest clock rate, small core count setup with HT turned off. So what if your latency jumps to 10-20msec every once in a while? You'll great best case and median numbers to show off, and you said you don't care about high percentiles.

However, if you care at all about you higher percentiles, you'll want to make sure to have enough cores to accommodate the peak number of runnable threads in your system, with some room to spare, and even then you may want to play some taskset, numactl, or isolcpu games. This is where higher core counts, dual socket setups, and turning HT on comes into play: they all help keep the high percentiles and outliers down (in both frequency and magnitude).

As for NICs. My experience is that burning a core or two on OpenOnload or the like (spinning with no back) is worth a couple of usec in the common case, and that this is especially true for non-saturated workloads (like your <1K/sec), where each latency is measured from a standstill (responding to a single event). If what you care about is latency (and not power, space, or saving a few $), burning those cores is a cheap way to get what you want. Same goes for the disruptor. Burning cores on controlled & dedicated spinning buys you better latency across the board, as long as you have plenty of cores to burn (it gets really bad when you don't have those cores).

When it comes to comparing clock frequencies, I usually tell people to compare the max turbo boost frequencies, rather than the nominal ones. E.g. an E5-2697 V3 has 14 cores, 35MB of L3, and shows 2.6GHz nominal. An E5-2687W V3 has 10 cores, 25MB of L3, and shows 3.1Ghz nominal. And an E5-2667 V3 has 8 cores, 20MB of L3, and shows 3.2Ghz nominal. So you may think the 3.2GHz number is best. But if you look carefully, all three chips have the same Turbo Boost frequency (3.6Ghz), and chances are that with the same number of cores actually busy, they'll all be running at the same frequency. But the larger chip (E5-2697) has more cores, which can help with outliers and allow for more generous "burning" of cores. And it has a significantly bigger L3 which can help with everything.

When you are latency sensitive, your best hint as an initial filter is power envelope: you should be looking at and choosing between chips in the higher power envelope rankings (135-160W usually). The lower power chips will keep your frequencies lower. [Also note that desktop chips tend to be classified by their turbo boost, as in "up to 4.4GHz", while server chips focus on nominal freq. (2.6GHz rather than "up to 3.6GHz"), that probably marketing driven].

For the common single socket vs. dual socket question, and Haswell E vs E5, I usually lean towards the E5s. Yes, something like the i7-4790K has a higher clock rate: 4.0GHz nominal, boosting to 4.4Ghz, but it only has 4 physical cores, and with the disruptor and a good NIC stack, you'll be burning those cores down fast, and may have 10-20msec outliers to deal with quickly, causing you to prematurely give up on spinning, and start paying in common case latency for this lack or spinning cores. Another E5 benefit is memory capacity and bandwidth. i7s tend to peak at 32GB, and have only 2 memory channels, while E5s have 4 channels, support faster DRAMs (e.g. 2133 vs. 1600), and a lot more  memory (typical config of 128GB per socket is commonplace in 2 socket servers today. E.g. that's what the latest EC2 machines seem to be built with).

As for 1 vs. 2 sockets: most 1 vs. 2 socket situations where the 1 socket system "wins" can be matched with nuactl and locking down workload (including their RAM) to one of the two sockets. 2 socket setups have the benefit of allowing you to separate background stuff (like the OS and rarely active processes or threads, and potentially your own non-crticial threads) from you critical workload. They do require more care, e.g. with things like IRQ assignment and making sure your packets are processed (in the network stack) in the same L3 that your consuming threads will be on, but at the end that extra capacity can be a lifesaver. 

Bottom line:

When people ask me about a "good" system for low latency (and this happens a lot), I answer that if $$ or power are not the drivers, and latency is, my current recommendation is to go with a 2 socket E5 (with something like an E5-2697 V3 rightness), and 128GB-256GB of memory. The street price for that is somewhere between $8K and $12K ($3K-$4K of which is spent when focusing on the faster, higher core count Xeons). I also recommend investing in a good 10G NIC in (like Solarflare) which will add some to that price, but buy you a few more usec.

Martin Thompson

unread,
Oct 24, 2014, 2:53:35 AM10/24/14
to mechanica...@googlegroups.com
On 24 October 2014 06:57, Gil Tene <g...@azulsystems.com> wrote:
[Some of this is cut/pasted from emails in which I had answered similar questions]

A few clarification points before you answer the question themselves:

You mentioned that throughput is not the main concern (<1K messages/sec expected), and that you want to minimize the latency between receiving an event and sending a response. But when you say "minimize the latency", do you mean:
a) you want to minimize the best case latency
b) you want to minimize the median latency
c) you want to minimize the worst experienced latency
d) some other level (e.g. minimize the 99.9%'lie)

The answers to some of your questions will vary depending on your answers. And the right answer is probably to have goals for at least three of the points above (b, c, and d at least).

If all you care about is the best case and the median latency, you are probably best off with a single socket, highest clock rate, small core count setup with HT turned off. So what if your latency jumps to 10-20msec every once in a while? You'll great best case and median numbers to show off, and you said you don't care about high percentiles.

However, if you care at all about you higher percentiles, you'll want to make sure to have enough cores to accommodate the peak number of runnable threads in your system, with some room to spare, and even then you may want to play some taskset, numactl, or isolcpu games. This is where higher core counts, dual socket setups, and turning HT on comes into play: they all help keep the high percentiles and outliers down (in both frequency and magnitude).

As for NICs. My experience is that burning a core or two on OpenOnload or the like (spinning with no back) is worth a couple of usec in the common case, and that this is especially true for non-saturated workloads (like your <1K/sec), where each latency is measured from a standstill (repos ending to a single event). If what you care about is latency (and not power, space, or saving a few $), burning those cores is a cheap way to get what you want. Same goes for the disruptor. Burning cores on controlled & dedicated spinning buys you better latency across the board, as long as you have plenty of cores to burn (it gets really bad when you don't have those cores).

When it comes to comparing clock frequencies, I usually tell people to compare the max turbo boost frequencies, rather than the nominal ones. E.g. an E5-2697 V3 has 14 cores, 35MB of L3, and shows 2.6GHz nominal. An E5-2687W V3 has 10 cores, 25MB of L3, and shows 3.1Ghz nominal. And an E5-2667 V3 has 8 cores, 20MB of L3, and shows 3.2Ghz nominal. So you may think the 3.2GHz number is best. But if you look carefully, all three chips have the same Turbo Boost frequency (3.6Ghz), and chances are that with the same number of cores actually busy, they'll all be running at the same frequency. But the larger chip (E5-2697) has more cores, which can help with outliers and allow for more generous "burning" of cores. And it has a significantly bigger L3m which can help with everything.

When you are latency sensitive, your best hint as an initial filter is power envelope: you should be looking at and choosing between chips in the higher power envelope rankings (135-160W usually). The lower power chips will keep your frequencies lower. [Also note that desktop chips tend to be classified by their turbo boost, as in "up to 4.4GHz", while server chips focus on nominal freq. (2.6GHz rather than "up to 3.6GHz"), that probably marketing driven].

For the common single socket vs. dual socket question, and Haswell E vs E5, I usually lean towards the E5s. Yes, something like the i7-4790K has a higher clock rate: 4.0GHz nominal, boosting to 4.4Ghz, but it only has 4 physical cores, and with the disruptor and a good NIC stack, you'll be burning those cores down fast, and may have 10-20msec outliers to deal with quickly, causing you to prematurely give up on spinning, and start paying in common case latency for this lack or spinning cores. Another E5 is memory capacity and bandwidth. i7s tend to leak at 32GB, and have only 2 memory channels, while E5s have 4 channels, support faster DRAMs (e.g. 2133 vs. 1600), and a lot more  memory (typical config of 128GB per socket is commonplace in 2 socket servers today. E.g. that's what the latest EC2 machines seem to be built with).

As for 1 vs. 2 sockets: most 1 vs. 2 socket situations where the 1 socket system "wins" can be matched with nuactl and locking down workload (including their RAM) to one of the two sockets. 2 socket setups have the benefit of allowing you to separate background stuff (like the OS and rarely active processes or threads, and potentially your own non-crticial threads) from you critical workload. They do require more care, e.g. with things like IRQ assignment and making sure your packets are processed (in the network stack) in the same L3 that your consuming threads will be on, but at the end that extra capacity can be a lifesaver. 

Bottom line:

When people ask me about a "good" system for low latency (and this happens a lot), I answer that if $$ or power are not the drivers, and latency is, my current recommendation is to go with a 2 socket E5 (with something like an E5-2697 V3 rightness), and 128GB-256GB of memory. The street price for that is somewhere between $8K and $12K ($3K-$4K of which is spent when focusing on the faster, higher core count Xeons). I also recommend investing in a good 10G NIC in (like Solarflare) which will add some to that price, but buy you a few more usec.

+1 A lot of good advice here.

I'd add to the point of getting a dual socket E5. You want to avoid L3 cache pollution from the OS and other processes on your machine. You can do this by isocups the second socket and run your app there. The Intel L3 cache is inclusive of L1 and L2, so another core can cause your data/code to be evicted from the L3 and thus your private L1/L2, even when threads are pinned to cores. Classic cold start problem after a quite period.

You need to strike a balance between having sufficient cores to *always* be able to run a thread when needed vs the larger L3 caches tend to have higher latencies, but this also needs to consider your working set size in L3. Be careful that a microbenchmark might suggest a smaller L3 is better because your working set is small for the microbenchmark that does radically change in a a real system requiring a larger L3 working set. 

I can vouch for the Solarflare network cards too but also consider Mellanox. In some benchmarks I've seen them be a touch faster again.

If money is no option then consider immersed liquid cooling and over clocking with Turbo always turned up.

Jean-Philippe BEMPEL

unread,
Oct 24, 2014, 3:52:25 AM10/24/14
to mechanica...@googlegroups.com
+1 with Gil, we are also in low latency space with low throughput and this is exactly what we are doing.


On Friday, October 24, 2014 7:57:04 AM UTC+2, Gil Tene wrote:
[Some of this is cut/pasted from emails in which I had answered similar questions]

A few clarification points before you answer the question themselves:

You mentioned that throughput is not the main concern (<1K messages/sec expected), and that you want to minimize the latency between receiving an event and sending a response. But when you say "minimize the latency", do you mean:
a) you want to minimize the best case latency
b) you want to minimize the median latency
c) you want to minimize the worst experienced latency
d) some other level (e.g. minimize the 99.9%'lie)

The answers to some of your questions will vary depending on your answers. And the right answer is probably to have goals for at least three of the points above (b, c, and d at least).

If all you care about is the best case and the median latency, you are probably best off with a single socket, highest clock rate, small core count setup with HT turned off. So what if your latency jumps to 10-20msec every once in a while? You'll great best case and median numbers to show off, and you said you don't care about high percentiles.

However, if you care at all about you higher percentiles, you'll want to make sure to have enough cores to accommodate the peak number of runnable threads in your system, with some room to spare, and even then you may want to play some taskset, numactl, or isolcpu games. This is where higher core counts, dual socket setups, and turning HT on comes into play: they all help keep the high percentiles and outliers down (in both frequency and magnitude).

As for NICs. My experience is that burning a core or two on OpenOnload or the like (spinning with no back) is worth a couple of usec in the common case, and that this is especially true for non-saturated workloads (like your <1K/sec), where each latency is measured from a standstill (responding to a single event). If what you care about is latency (and not power, space, or saving a few $), burning those cores is a cheap way to get what you want. Same goes for the disruptor. Burning cores on controlled & dedicated spinning buys you better latency across the board, as long as you have plenty of cores to burn (it gets really bad when you don't have those cores).

When it comes to comparing clock frequencies, I usually tell people to compare the max turbo boost frequencies, rather than the nominal ones. E.g. an E5-2697 V3 has 14 cores, 35MB of L3, and shows 2.6GHz nominal. An E5-2687W V3 has 10 cores, 25MB of L3, and shows 3.1Ghz nominal. And an E5-2667 V3 has 8 cores, 20MB of L3, and shows 3.2Ghz nominal. So you may think the 3.2GHz number is best. But if you look carefully, all three chips have the same Turbo Boost frequency (3.6Ghz), and chances are that with the same number of cores actually busy, they'll all be running at the same frequency. But the larger chip (E5-2697) has more cores, which can help with outliers and allow for more generous "burning" of cores. And it has a significantly bigger L3 which can help with everything.

When you are latency sensitive, your best hint as an initial filter is power envelope: you should be looking at and choosing between chips in the higher power envelope rankings (135-160W usually). The lower power chips will keep your frequencies lower. [Also note that desktop chips tend to be classified by their turbo boost, as in "up to 4.4GHz", while server chips focus on nominal freq. (2.6GHz rather than "up to 3.6GHz"), that probably marketing driven].

For the common single socket vs. dual socket question, and Haswell E vs E5, I usually lean towards the E5s. Yes, something like the i7-4790K has a higher clock rate: 4.0GHz nominal, boosting to 4.4Ghz, but it only has 4 physical cores, and with the disruptor and a good NIC stack, you'll be burning those cores down fast, and may have 10-20msec outliers to deal with quickly, causing you to prematurely give up on spinning, and start paying in common case latency for this lack or spinning cores. Another E5 benefit is memory capacity and bandwidth. i7s tend to peak at 32GB, and have only 2 memory channels, while E5s have 4 channels, support faster DRAMs (e.g. 2133 vs. 1600), and a lot more  memory (typical config of 128GB per socket is commonplace in 2 socket servers today. E.g. that's what the latest EC2 machines seem to be built with).

As for 1 vs. 2 sockets: most 1 vs. 2 socket situations where the 1 socket system "wins" can be matched with nuactl and locking down workload (including their RAM) to one of the two sockets. 2 socket setups have the benefit of allowing you to separate background stuff (like the OS and rarely active processes or threads, and potentially your own non-crticial threads) from you critical workload. They do require more care, e.g. with things like IRQ assignment and making sure your packets are processed (in the network stack) in the same L3 that your consuming threads will be on, but at the end that extra capacity can be a lifesaver. 

Bottom line:

When people ask me about a "good" system for low latency (and this happens a lot), I answer that if $$ or power are not the drivers, and latency is, my current recommendation is to go with a 2 socket E5 (with something like an E5-2697 V3 rightness), and 128GB-256GB of memory. The street price for that is somewhere between $8K and $12K ($3K-$4K of which is spent when focusing on the faster, higher core count Xeons). I also recommend investing in a good 10G NIC in (like Solarflare) which will add some to that price, but buy you a few more usec.
On Thursday, October 23, 2014 1:35:42 PM UTC-7, Michael Mattoss wrote:

Michael Mattoss

unread,
Oct 24, 2014, 10:44:53 AM10/24/14
to mechanica...@googlegroups.com
Thank you all for your replies and insights!
I would also like to take this opportunity to thank Martin and Gil for developing, and more importantly sharing, the Disruptor and HDRHistogram libraries!

Regarding the latency requirements, I'm mostly concerned with minimizing the 99.9%'lie latency.
Unfortunately, due to budget constraints (somewhere around $5K), dual socket server doesn't look like a viable option. I'm looking into Xeon E5 v3 and Haswell-E processors and have narrowed down the list of possible candidates but I need to do more research.

I have a few more questions:
1. What are pros & cons of disabling HT (other than the obvious reduction of logical cores)?
2. Does it make sense to enable HT to increase the total number of available cores but to isolate some of the physical cores and assigning them only 1 thread so that thread does share the physical core with any other threads?
3. Is there a rule of thumb on how to balance cache size against core speed?

Thank you,
Michael

Gil Tene

unread,
Oct 24, 2014, 11:31:38 AM10/24/14
to mechanica...@googlegroups.com
Budgeting tip: 

Always ask yourself this question: Is saving $5K to $10K in hardware budget worth 1 month of time to market and 1 month of engineering time?

- If the answer is yes, your application isn't worth much. I.e. [by definition] it's value is less than ($5K to $10K per month, minus cost of 1 month of engineering). And your engineers aren't being paid very well.

- If the answer is "That extra $5K in hardware won't buy me that that engineering time and time-to-market" (which is very reasonable for many apps) then what that answer really means your application's probably doesn't care about the 10-20% difference in the lower latency paths that the hardware will cheaply get you, and you can live fine in a $5K two socket server (e.g. E5-2640 V3) with 128GB of memory (that's the true commodity point for servers that are not extremely speed or latency sensitive these days).

If you are going to spend the time to get the low latency parts right, your engineering-hours budget involved in making latency behave well will far outrun any amount of $$ you spend on better commodity hardware. That's true unless you intend to deploy on several 100s of these machines, and even then I'd bet the engineering efforts would cost more than the hardware). You should expect to spend more on engineering than on hardware regardless. But when dealing with low latency, spending an extra $5K-$10K on your hardware to save yourself the time, cost, pain, and duct-tape involved in engineering your workload to fit into a low budget box ALWAYS makes sense.

Jean-Philippe BEMPEL

unread,
Oct 24, 2014, 4:37:23 PM10/24/14
to mechanica...@googlegroups.com
I do not know what are your numbers for low latency and how much you are sensitive to that, but for me as around 100us, It is _vital_ to have 2 sockets because if I have only one socket my latencies literally _double_ (200us) because of the L3 cache pollution due to administrative threads and other Linux processes...
99.9% is also pretty aggressive, specially if you have low throughput.

just my 2 cents.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/jdIhW0TaZQ4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Martin Thompson

unread,
Oct 24, 2014, 5:51:56 PM10/24/14
to mechanica...@googlegroups.com
+1

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Message has been deleted

Gil Tene

unread,
Oct 24, 2014, 7:18:00 PM10/24/14
to mechanica...@googlegroups.com
I couldn't resist. On my browser, Google groups shows this for Martin's last message:
 

+1

To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Steve Morin

unread,
Oct 25, 2014, 11:40:53 AM10/25/14
to mechanica...@googlegroups.com
What's isocups? Saw your reference to it but google didn't come up with much.

Georges Gomes

unread,
Oct 25, 2014, 11:50:34 AM10/25/14
to mechanica...@googlegroups.com
Isolcpu
Isolation of CPU
Google should better answer to that :)

On Sat, Oct 25, 2014, 17:40 Steve Morin <steve...@gmail.com> wrote:
What's isocups? Saw your reference to it but google didn't come up with much.

Martin Thompson

unread,
Oct 25, 2014, 11:50:50 AM10/25/14
to mechanica...@googlegroups.com

On 25 October 2014 16:40, Steve Morin <steve...@gmail.com> wrote:
What's isocups? Saw your reference to it but google didn't come up with much.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Karlis Zigurs

unread,
Oct 25, 2014, 11:54:49 AM10/25/14
to mechanica...@googlegroups.com
Don't forget to use taskset (or numactl) to launch the process on the 'freed' cores/socket afterwards as well.
Also it may be useful to tune garbage collector threads so that they match the dedicated core count (any real world experiences around this would be interesting to hear).

K

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Michael Mattoss

unread,
Oct 25, 2014, 2:24:29 PM10/25/14
to mechanica...@googlegroups.com
Thanks Gil, you raise some good points. I will have to reconsider the hardware budget.
Could you please answer the questions in my previous post regarding enabling/disabling HT and how does one weights cache size against core speed?

Thank you,
Michael

Michael Mattoss

unread,
Oct 25, 2014, 2:39:28 PM10/25/14
to mechanica...@googlegroups.com, jean-p...@bempel.fr
These numbers seem a bit high. At what percentile do you get to 100us latency?
Regarding the 99.9%'lie, I actually think quite the opposite: the less messages you have (such in low throughput environment), the greater the impact of responding late to a message.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Michael Mattoss

unread,
Oct 25, 2014, 2:43:41 PM10/25/14
to mechanica...@googlegroups.com
Out of curiosity, is there an equivalent to isolcpus in the Windows world?


On Saturday, October 25, 2014 6:50:50 PM UTC+3, Martin Thompson wrote:
On 25 October 2014 16:40, Steve Morin <steve...@gmail.com> wrote:
What's isocups? Saw your reference to it but google didn't come up with much.

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Jean-Philippe BEMPEL

unread,
Oct 25, 2014, 3:54:29 PM10/25/14
to mechanica...@googlegroups.com
Michael, 

For HT the thing is if the 2 threads shared data in L1 and L2 you are fine otherwise threads are polluting each other pulling lines from L3.

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/jdIhW0TaZQ4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Jean-Philippe BEMPEL

unread,
Oct 25, 2014, 4:08:03 PM10/25/14
to mechanica...@googlegroups.com
Consider we get 100us for 50%

The thing is at low rate cache are not very warm. This is why thread affinity and isolcpus are mandatory. 
If you increase throughput you generally observe better latencies. 

For percentiles this is a question of math : it depends the number of measures. For me at 99.9 Gc impact this percentile. 

For us at 99% as we have a low allocation rate minor Gc are not impacting our measures. (this is also because it includes coordinated omissions). 


On Saturday, October 25, 2014, Michael Mattoss <michael...@gmail.com> wrote:
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Michael Mattoss

unread,
Oct 25, 2014, 6:03:45 PM10/25/14
to mechanica...@googlegroups.com, jean-p...@bempel.fr
Perhaps I'm missing something here, but if you warm up the cache before the system switches to steady-state phase and you use thread affinity and isolcpus, why would the cache gets cold, even when the message rate is low? Does the CPU evict cache lines based on time?
As for GC's, I'm happy to say that I don't need to worry about them. I put a lot of work in designing & implementing my system so it would not allocate any memory during steady-state phase. Instead, it allocates all the memory it would need upfront (during the initialization phase) and afterwards, it just recycles objects through pools.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Martin Thompson

unread,
Oct 25, 2014, 7:12:34 PM10/25/14
to mechanica...@googlegroups.com
On 25 October 2014 23:03, Michael Mattoss <michael...@gmail.com> wrote:
Perhaps I'm missing something here, but if you warm up the cache before the system switches to steady-state phase and you use thread affinity and isolcpus, why would the cache gets cold, even when the message rate is low? Does the CPU evict cache lines based on time?

Intel run an inclusive cache policy whereby the L3 cache contains all the a lines in the private L1 and L2s for the same socket. Therefore if anything runs on that socket then the L3 cache will be impacted.  ISOCPUS and pinning does help but is not perfect, ssh will still run on all cores to gather entropy even when cores have been isolated. You also need to be aware that higher power saving states can evict various buffers and caches, e.g. the branch prediction, L0 instruction cache, and decode buffers etc just for the core alone, never mind the L1 and L2 caches.

If you are running a JVM then there are a number of things to watch out for like RMI doing a system GC to support distributed GC whether you need it or not!
 

Gil Tene

unread,
Oct 27, 2014, 1:01:17 AM10/27/14
to mechanica...@googlegroups.com
1. What are pros & cons of disabling HT (other than the obvious reduction of logical cores)?

The cons is loss of logical cores. That one is huge (extremely non-linear), which is why I always recommend people start with HT on, and only try turn it off late in the game, and to very carefully compare the pre/post situation with regard to outliers (i.e. compare multi-hour runs, not those silly 5 minute u-bench things). A "rare" 20msec spike in runnable threads that happens only once per hour can do a lot of damage, and HT on/off *could* make the difference between that spike killing you with outliers and not.

The pro varies a lot. A non-HT core has various "betterness" levels that depend on the core technology (e.g. Haskell vs. Westmere) and on your workload. For resources that are "dynamically partitioned" between HTs on a core, an idle HT has literally no effect. But there are resources that are are statically partitioned when HT is on, and halving those can result in various slowdown effects. E.g. halving the first level TLB (as was the case on some earlier cores) will effect you, but the effect would depend on ($K and 2M page level) locality of access. Branch prediction resources, reservation slots, etc. are also statically partitioned in some cores, and similarly could have an effect that varies form 0 to a lot...

Best thing to do is measure. Measure the effect on *your* common case latency, on *your* outliers, and on *your* throughput. [And report back on what you find]  
 
2. Does it make sense to enable HT to increase the total number of available cores but to isolate some of the physical cores and assigning them only 1 thread so that thread does share the physical core with any other threads?

You can certainly do that, but only if you were already planning to use isolcpus. When you use isolcpus and static thread-t-cpu assignment, keeping neighbors away from specific threads by "burning" 2 HTs for the one thread can be useful. But in a low load
system (like yours), the scheduler will tend to keep stuff spread to the physical cores anyway (again, the benefit of having chips with plenty of cores).
 
3. Is there a rule of thumb on how to balance cache size against core speed?

My rule of thumb is that cache is always better than core speed if you have non-i/o related cache misses. This obviously reverses at some point (At an L3 miss rate of 0.00001% I may go with core speed).

The harder tradeoff is cache size vs. L3 latency. Because the L3 sits in a a ring in modern Xeons, the larger the L3 is the more hops there are in the ring, and the higher the L3 latency can get to some (random) addresses. It gets more complicated, too. E.g. on the newer Haswell chips, going above a certain number of cores may step you into a longer-latency huge L3 ring, or may force you to partition the chip in two (with 2 shorter latency rings with limited bandwidth and higher latency when crossing rings).

But honestly, the detail levels like L3 latency variance between chip sizes is so *way* down in the noise at the point where most people start studying their latency behavior. You may get to actually caring about eventually, but you probably have 20 bigger fish to fry first.

Martin Thompson

unread,
Oct 27, 2014, 4:06:43 AM10/27/14
to mechanica...@googlegroups.com

On 27 October 2014 05:01, Gil Tene <g...@azulsystems.com> wrote:

3. Is there a rule of thumb on how to balance cache size against core speed?

My rule of thumb is that cache is always better than core speed if you have non-i/o related cache misses. This obviously reverses at some point (At an L3 miss rate of 0.00001% I may go with core speed).

The harder tradeoff is cache size vs. L3 latency. Because the L3 sits in a a ring in modern Xeons, the larger the L3 is the more hops there are in the ring, and the higher the L3 latency can get to some (random) addresses. It gets more complicated, too. E.g. on the newer Haswell chips, going above a certain number of cores may step you into a longer-latency huge L3 ring, or may force you to partition the chip in two (with 2 shorter latency rings with limited bandwidth and higher latency when crossing rings).

But honestly, the detail levels like L3 latency variance between chip sizes is so *way* down in the noise at the point where most people start studying their latency behavior. You may get to actually caring about eventually, but you probably have 20 bigger fish to fry first.

I've been spending a good chunk of this year building a low-latency system and lately been tracking the major causes of latency during the profiling phase. The top three causes of latency if you are working in 5-50us range are:

1. Allocation: You simply cannot have any allocation in your main path. Be very careful of lambdas in Java 8. Some that you think should not allocate, do allocate. And don't get me started on APIs like NIO!!!

2. Resource Contention: You always need to have more cores available than the number of threads that need to run. HT helps somewhat but is really a poor person's alternative to having sufficient real cores. You often need more cores than your think.

3. Wait-Free vs Lock-Free algorithms: When a controlling thread takes an interrupt then other threads involved in the same algorithm cannot make progress. Reworking core algorithms to be wait-free, in addition to lock-free, made a much better improvement to the long tail than I expected - even in the case of not having applied ISOCPUs or thread pinning. We are explicitly designing our system so that it behaves as well as possible on a vanilla distribution, as well as when pinning and other tricks are employed.

These sorts of things matter more than the size of your L3 cache. The difference in latencies between different L3 caches sizes can easily be traded off by just slightly reducing the amount of pointer chasing you do in your code.

Michael Mattoss

unread,
Oct 27, 2014, 5:47:43 PM10/27/14
to mechanica...@googlegroups.com
I would like to thank everyone for sharing their knowledge and insights. Much appreciated!
Hopefully, I'll be able to share some data in a few weeks.

Thanks again,
Michael

Tom Lee

unread,
Oct 28, 2014, 12:52:54 AM10/28/14
to mechanica...@googlegroups.com
1. Allocation: You simply cannot have any allocation in your main path. Be very careful of lambdas in Java 8. Some that you think should not allocate, do allocate. And don't get me started on APIs like NIO!!!

Love to hear more about what you have to say here, Martin. We've sort of discovered this the hard way. Anything on the hot path needs to be extremely conservative about allocations (at least on the JVM -- can't speak to native code, but I can imagine that malloc/free could hurt just as much as GC there). Considered pooling briefly in the past, but our working set is often large enough that it basically becomes a choice between potentially exhausting pools or cluttering old gen. Even without pooling, careful reuse of objects between iterations can be fraught with risk as any other allocations on the hot path can quietly shuttle any number of objects over to old gen. It's tough.

Strings are also a particularly painful topic as far as the JVM is concerned: often forced into lots of unwanted copies and allocations. It's almost like you need to rebuild string operations from scratch on top of plain old byte arrays to do anything without a lot of spam.

Maybe a silly question, but are folks doing high performance work with the JVM writing their own string handling routines or something?

Interesting that you bring up NIO too. We've certainly seen some very strange behavior with NIO allocating huge amounts of memory off-heap when left unchecked, but it never occurred to me there might be allocation-related performance issues there.

Cheers,
Tom


--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Tom Lee http://tomlee.co / @tglee

Richard Warburton

unread,
Oct 28, 2014, 4:50:36 AM10/28/14
to mechanica...@googlegroups.com
Hi,

Strings are also a particularly painful topic as far as the JVM is concerned: often forced into lots of unwanted copies and allocations. It's almost like you need to rebuild string operations from scratch on top of plain old byte arrays to do anything without a lot of spam.

Maybe a silly question, but are folks doing high performance work with the JVM writing their own string handling routines or something?

I've proposed a patch to String on core libs which lets you more cheaply encode and decode them from/to ByteBuffer and subsequences of byte[]. Its not accepted yet, but the response seemed to be positive. So hopefully Java 9 will be a bit better in this regard. It would be nice to add similar functionality to StringBuffer/StringBuilder as well.

It would be nice to be able to implement a CharSequence that wraps some underlying source of bytes but the problem with this kind of approach is that a lot of APIs take String over CharSequence. So its not that you just end up reimplementing String - you also end up reimplementing a lot of other stuff.

The question always arises in my head - "Why are you using Strings?" If its because you want a human-readable data format then using a binary encoding which has a program which lets you pretty-print the encoding is just as good IMO and avoids a lot of these issues. If you're interacting with an external protocol which is text based I can appreciate that this isn't a decision you can make so easily.

Interesting that you bring up NIO too. We've certainly seen some very strange behavior with NIO allocating huge amounts of memory off-heap when left unchecked, but it never occurred to me there might be allocation-related performance issues there.

One of the specific on-heap NIO related allocate hot-spots seems to be in Selectors, which have to update an internal Hashset inside select(). If you're processing a lot of selects then this can problematic.

regards,

  Richard Warburton

Martin Thompson

unread,
Oct 28, 2014, 5:38:32 AM10/28/14
to mechanica...@googlegroups.com
On 28 October 2014 04:52, Tom Lee <m...@tomlee.co> wrote:
1. Allocation: You simply cannot have any allocation in your main path. Be very careful of lambdas in Java 8. Some that you think should not allocate, do allocate. And don't get me started on APIs like NIO!!!

Love to hear more about what you have to say here, Martin. We've sort of discovered this the hard way. Anything on the hot path needs to be extremely conservative about allocations (at least on the JVM -- can't speak to native code, but I can imagine that malloc/free could hurt just as much as GC there). Considered pooling briefly in the past, but our working set is often large enough that it basically becomes a choice between potentially exhausting pools or cluttering old gen. Even without pooling, careful reuse of objects between iterations can be fraught with risk as any other allocations on the hot path can quietly shuttle any number of objects over to old gen. It's tough.

The summary is that if you are working in the 10s of microsecond space then any allocation will result in significant outliers regardless of JVM. Anyone in HFT is usually well aware of this issue.

However the picture gets more interesting for general purpose applications. I get to profile quite a few applications in the field across a number of domains. The thing this has taught me is that the biggest performance improvements often come from simply reducing the allocation rate. A little bit of allocation profiling and some minor code changes can give big returns. I recommend to all my customers that that run a profiler regularly and keep allocation to modest levels, regardless of application type, and the returns are significant for minimal effort. Our current garbage collectors seem to be not up to the job of coping with the new multicore world and large memory servers - Zing excluded :-)

Allocation is just as big an issue in the C/C++ world. In all worlds it is the reclamation rather than the allocation that is the issue. Just try allocating memory on one thread and freeing it on another in a native language and see how that performs. The big benefit we get in the native world is stack allocation. This has so many benefits besides cheap allocation and reclamation, it is also local in the hot OS page and does not suffer false sharing issues. In the multicore world I so miss stack allocation when using Java. It does not have to be a language feature, Escape Analysis could be better or an alternative like object explosion JRockit can help.

Pools can be useful technique to be used at times. Especially a big win when handing things back and forth between two threads. However pooling should be used with extreme caution and guided by measurement. Objects in pools need to be considered immortal.
 
Strings are also a particularly painful topic as far as the JVM is concerned: often forced into lots of unwanted copies and allocations. It's almost like you need to rebuild string operations from scratch on top of plain old byte arrays to do anything without a lot of spam.

Maybe a silly question, but are folks doing high performance work with the JVM writing their own string handling routines or something?

Not as strange as you would think. Besides the proliferation of XML and JSON encodings we have to deal with text even on some of the highest performance systems. Most financial exchanges still use the tag value encoded FIX protocol. Thankfully many exchanges are now moving to binary encodings but it will take time for the majority to migrate.

Codecs burn more CPU cycles than anything else in most applications. We need better examples for people to copy and the right primitives for building parsers, especially in Java. We need simple things like being able to go between ByteBuffer and String without intermediate copies that go to and from byte[]s, e.g. We need a constructor something like String(ByteBuffer bb, Charset cs) and a String method int getBytes(ByteBuffer bb, Charset cs).

We also need to be able to deal with text encoded numbers, dates, times, etc. in ByteBuffer and byte[] with out allocation and copying. Things like writeInt(int value) as ASCII or UTF-8 to and from ByteBuffer or byte[]. This way the bounds checking can be done once, take less reference indirections per operation, and totally avoid allocation.

When I build parsers I often write my own String handling classes and design APIs so they are composable. That is, have the lower level APIs trade complexity for efficiency and allow them to be wrapped/composed with APIs that are idiomatic or easier to use. This way the end user can have a choice rather than treating them like children who cannot make choices. On a quick scan of the proposed JSON API for Java 9, none of the important lessons seem to have been learned.
 
Interesting that you bring up NIO too. We've certainly seen some very strange behavior with NIO allocating huge amounts of memory off-heap when left unchecked, but it never occurred to me there might be allocation-related performance issues there.

I'd be here all day if you got me started on NIO :-(
 
Just take the simple example of Selector. You do a selectNow() then you get a set of selected key set that you must iterate over. Why not just take a callback to selectNow() or pass in a collection to fill. This and the likes of String.split() are examples of brain dead API design that causes performance issues. Richard and I have worked on this likes of this and he has at least listened to my whines and is trying to do something about it.

Vitaly Davidovich

unread,
Oct 28, 2014, 1:16:15 PM10/28/14
to mechanica...@googlegroups.com

+1.  It doesn't help that the mainstream java community continues to endorse and promote the "allocations are cheap" myth.

Sent from my phone

--
<