what is our hardware doing???

Kirk Pepperdine

unread,

Sep 3, 2013, 3:07:21 AM9/3/13

to mechanica...@googlegroups.com

Hi,

Part of my quest over the last number of years has been looking for key indicators that state, something is wrong and you can get better performance out of the hardware. In this case I'm not really talking about traditional algorithmic performance (eg. bubble vs quick sort) but more from an orthogonal question; how is our hardware coping with what we are asking it to do? Eventually this should push back to the algorithmic question but as I like to say, this isn't really a question of the destination, it's the journey that matters.

In the past I've kept pushing on patterns of CPU utilization that are expressed on some of the higher level performance counters but I'm now finding with newer hardware that the reading of these counters is getting trickier and requires more detailed complex decision trees to describe them or other techniques to get missing information that wasn't needed in the past. Simply put, I can't keep the decision tree in my infinitesimally small and aging brain so I've been thinking about a newer way to expose answers to the question, what is the hardware up to. In other words, if my performance needs to be better, what are the key indicators that are easy and cheap to collect. On the other side of the coin we also need heuristics or known thresholds to help us understand when we should expect more from the hardware we have or when we need to some how back off or throw more hardware at the problem.

Part of the motivation was in trying to understand why the stop-the-world collection is so damaging to performance even when GC is efficient and only happening at rate of say every 200 to 500ms. So, if we can ignore or not get bogged down in the trivial or obvious things I believe there is a deeper problem that actually prevents apps suffering from frequent GC to wind up the hardware and that is a huge performance drain just onto it's self. The question is, what is the cause and how do we see it. In this case I believe that a combination of safe-pointing behaviour and how CPU utilization is being reported effectively hides the disruption. I started wondering if instruction retirement rates would be a better measure of CPU utilization and if so would frequent GC's would show up as reduced retirement rates. Of course the next question is; wouldn't retirement rates be a better indicator than the super flawed CPU % utilization that we've been using forever. In this case the safe-point page fault vs run time of the collector results in GC and app threads not really being able to get going before they're done. In other words, the costs of the collection are poorly amortized with bursts of overly short GC runs

I've only started (re)thinking this and I've yet to run any experiments... yet. This is more of an attempt to gather up more ideas.

-- Kirk

Gil Tene

unread,

Sep 3, 2013, 4:27:46 AM9/3/13

to mechanica...@googlegroups.com

I think that watching issue or retirement rates and metrics like CPI and L1/L2/L3 misses can certainly teach us a lot about code behavior, code efficiency, and potential optimization opportunities. But let me step in to defend CPU utilization % metric - I think it is one if the most useful and historically reliable metric available in computers today, and as I fear that its reliability may be coming under attack by modern "tools" like hypervisors and SMT CPU cores.

CPU utilization is a very important metric that is often misused. It tells us how busy we are, but more importantly, how un-busy we are. For CPU utiluzation to be meaningful, it needs to indicate the amount of idle time the system is experiencing. CPU utilization % is the most important and suscinct metric a sysadmin or a capacity planner can look at, for example, as when properly measured, it shows how much unused capacity a system has, and how far from "full" it is. most sane sysadmins will keep peak sustained utilization levels somewhere between 20-60% of the levels that they know are "demonstarably safe" for the application at hand, knowing that headroom is th #1 tool for stability under load. They also know that saturation is the mother if all evils when it comes to system availability and acceptible service levels, and use metrics like CPU utilization to stay as far away from saturation as their budgets will allow.

To the sane admin and the experienced capacity tester, the world is simple and very empirical: if a system was demonstrably able to happily support a workload of X at some far-from-saturated CPU % Y, then a sysadmin who is interested in job security can sleep well at night as long as the CPU load never ever goes above some level (say Y/3), and will call his boss asking for immediate budget when that line gets crossed. CPU % doesn't cover all the resources that need to be watched for how-far-from-empirical-saturation-am-I conditions (like network bit rates, and disk i/o operation rates), but it does correctly summerize many others (CPU cycle use, coherency and memory bandwidth use, interrupts, etc.) into a single easy to watch how-far-from-trouble-am-I point of view.

Unfortunately, most benchmarking is done at CPU utilization levels that are nowhere near the range sane administrators will ever allow their systems to operate, and fortunately sane sysadmins have long ago learned to completely disregard most synthetic benchmarks when trying to draw "how much can my stuff really handle" lines for production. Beyond the other follies often found in synthetic benchmaking, it is the practice if at-saturation testing, which is what most such benchmarks spend the majority of their time measuring, that lead to many mis-measurements. This aruses quite simply from the fact that at saturation things behave differently, in multiple directions, than they would under the normal operating parameters that administrators work so hard to keep their systems in. I.e. some things get much much better at saturation, some are much much worse, and only some things actually stay the same. Here are some key examples:

Some things that get unrealistically better:
- CPU cache efficiency and miss rates usually get better at or near saturation.
- I/O throughout carrying capacity usually gets much better at or near saturation.
- I/O efficiency and the amount of CPU spent on handling I/O droos dramatically at or near saturation.
- GC efficiency can often increase at or near saturation.

Some things that get unrealistically worse:
- CPU scheduling delays and resulting externally measured latency behavior often get dramatically worse at or near saturation. Often introducing entirely new behavior modes to latency measurements.
- I/O subsystem latency behavior (not those silly averages, but the real metrics that matter like 99.9%'iles) usually get dramatically worse at or near saturation as queuing a effects kick in.
- Power management on modern CPUs can often slow things dramatically at or near saturation levels.

Martin Thompson

unread,

Sep 3, 2013, 5:04:48 AM9/3/13

to mechanica...@googlegroups.com

It is a very interesting topic.

%CPU is measure I've been having issues with for a lot of Java applications, not such an issue for C apps. I think Kirk and I keep seeing issues with overuse of safepointing in the JVM that is limiting throughput and increasing latency, yet it shows up as low %CPU utilisation. I've had people argue with me that their lock-based algorithm is better because of low CPU utilisation yet when I dig into it with them we cannot get over 70% utilisation because of safepointing.

It would be nice if the JVM was integrated with Linux perf. This way we could track CPU counters on context switch so that IPC (Instructions Per Cycle) is a useful metric. I'd like to see this expanded to show now many cycles are wasted due to safepoints, and thus stalling progress.

IPC is only useful when directly attributed to a running thread. This would be possible at relatively low overhead if the counters are collected at mode or context switches, and logged against particular threads. If we did this we can track how efficient an algorithm is when running and the lost opportunities it has for progress due to safepointing, context switching, or mode switching on sys calls.

Martin...

Gil Tene

unread,

Sep 3, 2013, 1:48:32 PM9/3/13

to mechanica...@googlegroups.com

On a 32 vcore machine (which is what most of newer 2 socket commodity servers are these days) Amdahl wakes up early in the morning.

Inability to keeping CPUs busy can be easily modeled with Amdhal's law, and it this case it [luckily] has nothing to do with CPU speed or whether or not hyperthreads and vcores provide true linearly additional work carrying capacity. They provide the real ability to keep virtual CPUs busy, which is all that matters for the CPU% indicators. Backtracking Moore's law for a 32 vocre system, not being able to go past 70% system utilization translates into roughly 1.38% of the work being serial (in the sense that other vcores are sitting idle waiting for this work to be done). [1/(B + (1-B)*(1/32)) = 32 * 0.7 solves roughly to B=0.0138]. This is a surprisingly low number to many people.

So for safepointing work to be responsible for ~1.38% of work and keeping your CPU% from getting past ~70%, the ratio between your experienced inter-safepoint intervals and your intra-safepoint time would need to be ~1:72. E.g. if you take a safepoint once every second (for a newgen GC for example), then if the total time you spend in the safepoint (including getting into it, doing the work under it, and getting out of it, the sum of which is mostly included in -XX:+PrintGCApplicationStoppedTime) exceeded 13msec, you'd have this condition. When your safepoint rates are quicker than that, the intra-time needed to cause this limitation would track linearly. E.g. if you safepoint 4 times per second, intra-safepoint times longer than ~3.5msec would cause the same 70% cap on achievable CPU%. Similarly, increasing the number of cores (we already have cheap 64 and 80 vcores x86 boxes out there, and more are coming with Ivy bridge) causes a similarly linear drop in the amount of intra-safepoint time that systems can tolerate without running into the I- can't-seem-to-keep-my-cpus-busy problems even in what might seem to be lock-free code.

Dropping your safepointing frequency even at cpu-saturated load points (by coding to reduce GC pressure or by increasing GC efficiency in some way) is the obvious way to combat this. But TTSP ((time to safepoint) deserves a special mention here. Note that at those levels, TTSP is often as important as the actual work done in the safepoint, as the time to safepoint can often dominate PrintGCApplicationStoppedTime output. This is one of the reasons we started working on TTSP early in our history at Azul, even before we started focusing on low latency apps and well before we moved to x86 - When you have an actual 400+ core box, it takes less than 0.1% (and less than 1 msec of TTSP per second) to trigger a conspicuous level of inability-to-use-your-cpus. With concurrent collectors, TTSP tends to dominate the serialization imposed by the JVM, which is why we pay a LOT of attention to keeping it low in Zing.

Reply all

Reply to author

Forward