--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
[Some of this is cut/pasted from emails in which I had answered similar questions]A few clarification points before you answer the question themselves:You mentioned that throughput is not the main concern (<1K messages/sec expected), and that you want to minimize the latency between receiving an event and sending a response. But when you say "minimize the latency", do you mean:a) you want to minimize the best case latencyb) you want to minimize the median latencyc) you want to minimize the worst experienced latencyd) some other level (e.g. minimize the 99.9%'lie)The answers to some of your questions will vary depending on your answers. And the right answer is probably to have goals for at least three of the points above (b, c, and d at least).If all you care about is the best case and the median latency, you are probably best off with a single socket, highest clock rate, small core count setup with HT turned off. So what if your latency jumps to 10-20msec every once in a while? You'll great best case and median numbers to show off, and you said you don't care about high percentiles.However, if you care at all about you higher percentiles, you'll want to make sure to have enough cores to accommodate the peak number of runnable threads in your system, with some room to spare, and even then you may want to play some taskset, numactl, or isolcpu games. This is where higher core counts, dual socket setups, and turning HT on comes into play: they all help keep the high percentiles and outliers down (in both frequency and magnitude).
As for NICs. My experience is that burning a core or two on OpenOnload or the like (spinning with no back) is worth a couple of usec in the common case, and that this is especially true for non-saturated workloads (like your <1K/sec), where each latency is measured from a standstill (repos ending to a single event). If what you care about is latency (and not power, space, or saving a few $), burning those cores is a cheap way to get what you want. Same goes for the disruptor. Burning cores on controlled & dedicated spinning buys you better latency across the board, as long as you have plenty of cores to burn (it gets really bad when you don't have those cores).When it comes to comparing clock frequencies, I usually tell people to compare the max turbo boost frequencies, rather than the nominal ones. E.g. an E5-2697 V3 has 14 cores, 35MB of L3, and shows 2.6GHz nominal. An E5-2687W V3 has 10 cores, 25MB of L3, and shows 3.1Ghz nominal. And an E5-2667 V3 has 8 cores, 20MB of L3, and shows 3.2Ghz nominal. So you may think the 3.2GHz number is best. But if you look carefully, all three chips have the same Turbo Boost frequency (3.6Ghz), and chances are that with the same number of cores actually busy, they'll all be running at the same frequency. But the larger chip (E5-2697) has more cores, which can help with outliers and allow for more generous "burning" of cores. And it has a significantly bigger L3m which can help with everything.
When you are latency sensitive, your best hint as an initial filter is power envelope: you should be looking at and choosing between chips in the higher power envelope rankings (135-160W usually). The lower power chips will keep your frequencies lower. [Also note that desktop chips tend to be classified by their turbo boost, as in "up to 4.4GHz", while server chips focus on nominal freq. (2.6GHz rather than "up to 3.6GHz"), that probably marketing driven].
For the common single socket vs. dual socket question, and Haswell E vs E5, I usually lean towards the E5s. Yes, something like the i7-4790K has a higher clock rate: 4.0GHz nominal, boosting to 4.4Ghz, but it only has 4 physical cores, and with the disruptor and a good NIC stack, you'll be burning those cores down fast, and may have 10-20msec outliers to deal with quickly, causing you to prematurely give up on spinning, and start paying in common case latency for this lack or spinning cores. Another E5 is memory capacity and bandwidth. i7s tend to leak at 32GB, and have only 2 memory channels, while E5s have 4 channels, support faster DRAMs (e.g. 2133 vs. 1600), and a lot more memory (typical config of 128GB per socket is commonplace in 2 socket servers today. E.g. that's what the latest EC2 machines seem to be built with).
As for 1 vs. 2 sockets: most 1 vs. 2 socket situations where the 1 socket system "wins" can be matched with nuactl and locking down workload (including their RAM) to one of the two sockets. 2 socket setups have the benefit of allowing you to separate background stuff (like the OS and rarely active processes or threads, and potentially your own non-crticial threads) from you critical workload. They do require more care, e.g. with things like IRQ assignment and making sure your packets are processed (in the network stack) in the same L3 that your consuming threads will be on, but at the end that extra capacity can be a lifesaver.Bottom line:When people ask me about a "good" system for low latency (and this happens a lot), I answer that if $$ or power are not the drivers, and latency is, my current recommendation is to go with a 2 socket E5 (with something like an E5-2697 V3 rightness), and 128GB-256GB of memory. The street price for that is somewhere between $8K and $12K ($3K-$4K of which is spent when focusing on the faster, higher core count Xeons). I also recommend investing in a good 10G NIC in (like Solarflare) which will add some to that price, but buy you a few more usec.
[Some of this is cut/pasted from emails in which I had answered similar questions]A few clarification points before you answer the question themselves:You mentioned that throughput is not the main concern (<1K messages/sec expected), and that you want to minimize the latency between receiving an event and sending a response. But when you say "minimize the latency", do you mean:a) you want to minimize the best case latencyb) you want to minimize the median latencyc) you want to minimize the worst experienced latencyd) some other level (e.g. minimize the 99.9%'lie)The answers to some of your questions will vary depending on your answers. And the right answer is probably to have goals for at least three of the points above (b, c, and d at least).If all you care about is the best case and the median latency, you are probably best off with a single socket, highest clock rate, small core count setup with HT turned off. So what if your latency jumps to 10-20msec every once in a while? You'll great best case and median numbers to show off, and you said you don't care about high percentiles.However, if you care at all about you higher percentiles, you'll want to make sure to have enough cores to accommodate the peak number of runnable threads in your system, with some room to spare, and even then you may want to play some taskset, numactl, or isolcpu games. This is where higher core counts, dual socket setups, and turning HT on comes into play: they all help keep the high percentiles and outliers down (in both frequency and magnitude).
As for NICs. My experience is that burning a core or two on OpenOnload or the like (spinning with no back) is worth a couple of usec in the common case, and that this is especially true for non-saturated workloads (like your <1K/sec), where each latency is measured from a standstill (responding to a single event). If what you care about is latency (and not power, space, or saving a few $), burning those cores is a cheap way to get what you want. Same goes for the disruptor. Burning cores on controlled & dedicated spinning buys you better latency across the board, as long as you have plenty of cores to burn (it gets really bad when you don't have those cores).When it comes to comparing clock frequencies, I usually tell people to compare the max turbo boost frequencies, rather than the nominal ones. E.g. an E5-2697 V3 has 14 cores, 35MB of L3, and shows 2.6GHz nominal. An E5-2687W V3 has 10 cores, 25MB of L3, and shows 3.1Ghz nominal. And an E5-2667 V3 has 8 cores, 20MB of L3, and shows 3.2Ghz nominal. So you may think the 3.2GHz number is best. But if you look carefully, all three chips have the same Turbo Boost frequency (3.6Ghz), and chances are that with the same number of cores actually busy, they'll all be running at the same frequency. But the larger chip (E5-2697) has more cores, which can help with outliers and allow for more generous "burning" of cores. And it has a significantly bigger L3 which can help with everything.
When you are latency sensitive, your best hint as an initial filter is power envelope: you should be looking at and choosing between chips in the higher power envelope rankings (135-160W usually). The lower power chips will keep your frequencies lower. [Also note that desktop chips tend to be classified by their turbo boost, as in "up to 4.4GHz", while server chips focus on nominal freq. (2.6GHz rather than "up to 3.6GHz"), that probably marketing driven].
For the common single socket vs. dual socket question, and Haswell E vs E5, I usually lean towards the E5s. Yes, something like the i7-4790K has a higher clock rate: 4.0GHz nominal, boosting to 4.4Ghz, but it only has 4 physical cores, and with the disruptor and a good NIC stack, you'll be burning those cores down fast, and may have 10-20msec outliers to deal with quickly, causing you to prematurely give up on spinning, and start paying in common case latency for this lack or spinning cores. Another E5 benefit is memory capacity and bandwidth. i7s tend to peak at 32GB, and have only 2 memory channels, while E5s have 4 channels, support faster DRAMs (e.g. 2133 vs. 1600), and a lot more memory (typical config of 128GB per socket is commonplace in 2 socket servers today. E.g. that's what the latest EC2 machines seem to be built with).
As for 1 vs. 2 sockets: most 1 vs. 2 socket situations where the 1 socket system "wins" can be matched with nuactl and locking down workload (including their RAM) to one of the two sockets. 2 socket setups have the benefit of allowing you to separate background stuff (like the OS and rarely active processes or threads, and potentially your own non-crticial threads) from you critical workload. They do require more care, e.g. with things like IRQ assignment and making sure your packets are processed (in the network stack) in the same L3 that your consuming threads will be on, but at the end that extra capacity can be a lifesaver.Bottom line:When people ask me about a "good" system for low latency (and this happens a lot), I answer that if $$ or power are not the drivers, and latency is, my current recommendation is to go with a 2 socket E5 (with something like an E5-2697 V3 rightness), and 128GB-256GB of memory. The street price for that is somewhere between $8K and $12K ($3K-$4K of which is spent when focusing on the faster, higher core count Xeons). I also recommend investing in a good 10G NIC in (like Solarflare) which will add some to that price, but buy you a few more usec.
On Thursday, October 23, 2014 1:35:42 PM UTC-7, Michael Mattoss wrote:
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/jdIhW0TaZQ4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
+1
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
What's isocups? Saw your reference to it but google didn't come up with much.
What's isocups? Saw your reference to it but google didn't come up with much.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
On 25 October 2014 16:40, Steve Morin <steve...@gmail.com> wrote:
What's isocups? Saw your reference to it but google didn't come up with much.
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.
--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/jdIhW0TaZQ4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.
To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsubscribe...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Perhaps I'm missing something here, but if you warm up the cache before the system switches to steady-state phase and you use thread affinity and isolcpus, why would the cache gets cold, even when the message rate is low? Does the CPU evict cache lines based on time?
1. What are pros & cons of disabling HT (other than the obvious reduction of logical cores)?
2. Does it make sense to enable HT to increase the total number of available cores but to isolate some of the physical cores and assigning them only 1 thread so that thread does share the physical core with any other threads?
3. Is there a rule of thumb on how to balance cache size against core speed?
3. Is there a rule of thumb on how to balance cache size against core speed?My rule of thumb is that cache is always better than core speed if you have non-i/o related cache misses. This obviously reverses at some point (At an L3 miss rate of 0.00001% I may go with core speed).The harder tradeoff is cache size vs. L3 latency. Because the L3 sits in a a ring in modern Xeons, the larger the L3 is the more hops there are in the ring, and the higher the L3 latency can get to some (random) addresses. It gets more complicated, too. E.g. on the newer Haswell chips, going above a certain number of cores may step you into a longer-latency huge L3 ring, or may force you to partition the chip in two (with 2 shorter latency rings with limited bandwidth and higher latency when crossing rings).But honestly, the detail levels like L3 latency variance between chip sizes is so *way* down in the noise at the point where most people start studying their latency behavior. You may get to actually caring about eventually, but you probably have 20 bigger fish to fry first.
1. Allocation: You simply cannot have any allocation in your main path. Be very careful of lambdas in Java 8. Some that you think should not allocate, do allocate. And don't get me started on APIs like NIO!!!
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
Strings are also a particularly painful topic as far as the JVM is concerned: often forced into lots of unwanted copies and allocations. It's almost like you need to rebuild string operations from scratch on top of plain old byte arrays to do anything without a lot of spam.Maybe a silly question, but are folks doing high performance work with the JVM writing their own string handling routines or something?
Interesting that you bring up NIO too. We've certainly seen some very strange behavior with NIO allocating huge amounts of memory off-heap when left unchecked, but it never occurred to me there might be allocation-related performance issues there.
1. Allocation: You simply cannot have any allocation in your main path. Be very careful of lambdas in Java 8. Some that you think should not allocate, do allocate. And don't get me started on APIs like NIO!!!Love to hear more about what you have to say here, Martin. We've sort of discovered this the hard way. Anything on the hot path needs to be extremely conservative about allocations (at least on the JVM -- can't speak to native code, but I can imagine that malloc/free could hurt just as much as GC there). Considered pooling briefly in the past, but our working set is often large enough that it basically becomes a choice between potentially exhausting pools or cluttering old gen. Even without pooling, careful reuse of objects between iterations can be fraught with risk as any other allocations on the hot path can quietly shuttle any number of objects over to old gen. It's tough.
Strings are also a particularly painful topic as far as the JVM is concerned: often forced into lots of unwanted copies and allocations. It's almost like you need to rebuild string operations from scratch on top of plain old byte arrays to do anything without a lot of spam.Maybe a silly question, but are folks doing high performance work with the JVM writing their own string handling routines or something?
Interesting that you bring up NIO too. We've certainly seen some very strange behavior with NIO allocating huge amounts of memory off-heap when left unchecked, but it never occurred to me there might be allocation-related performance issues there.
+1. It doesn't help that the mainstream java community continues to endorse and promote the "allocations are cheap" myth.
Sent from my phone
--