KVM impact on long tail latency

481 views
Skip to first unread message

Ariel Weisberg

unread,
Oct 7, 2014, 5:17:51 PM10/7/14
to mechanica...@googlegroups.com
I am working with a requirement to keep latency at P99.999 under 30 milliseconds over several minute time scales under RHEL 6.3 and KVM. This is in an environment where currently there is only a single guest VM although I am told there are plans to run a second VM for client applications.

After doing a few tests it's pretty clear that the performance after P99 is degraded and once you start clustering and synchronously replicating the wheels come off. Right now I am measuring an in-memory database without persistence enabled so disk IO is not a factor.

From an application perspective I have seen this manifest in diverse ways. Time to safe point is randomly long, young gen GCs are generally long with some outliers that are extra long without a corresponding increase in live objects, native function calls to read/write from sockets or invocations of Selector.wakeup() randomly take a long time, canary threads occasionally don't get to run on schedule. Some times the long native calls are obviously delayed reentry due to GC, but some times not.

To give a perspective based on various performance measurements. Running jHiccup on dummy app that does nothing on an idle system.

jHiccup on my desktop gives 
0.13 milliseconds at 99.9 
0.44 milliseconds at 99.99 
5.87 milliseconds at 99.999 

On one of the VMs 
10.03 milliseconds at 99.9 
11.08 milliseconds at 99.99 
11.27 milliseconds at 99.999 

On a bare metal server
0.16 milliseconds at 99.9
0.41 milliseconds at 99.99
0.77 milliseconds at 99.999

Under KVM and running a very basic benchmark with 10 concurrent threads has a P99.999 of 20 milliseconds on a single node (client and server same node). With a 3 node cluster the latency jumps to 40+ milliseconds at 5 9s. This is running on Sandy Bridge with 16 cores and one core not presented to the VM.

By way of comparison that same benchmark running on a bare metal server (Sandy Bridge, 12 cores) has a max latency of 6 milliseconds. Running on my desktop the P99.999% is 6 milliseconds and the max is 9. We do the clustered version of a similar benchmark in continous integration and the P99.999 is 5.76.

I have found recommendations for doing things like pinning virtual cores to physical and presenting cores as their own NUMA nodes to cause the guest to lock split. Has anyone tried these tricks or others to get decent behavior out of KVM?

Todd Lipcon

unread,
Oct 7, 2014, 5:22:45 PM10/7/14
to mechanica...@googlegroups.com
Hey Ariel,

It's not KVM-specific, but have you checked that the machine isn't entering higher C-states? It can take a while to wake up from a higher C-state in many cases.


-Todd

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ariel Weisberg

unread,
Oct 7, 2014, 5:32:33 PM10/7/14
to mechanica...@googlegroups.com, to...@lipcon.org
Thanks Todd I am completely forgot about power management. I am told that cpuspeed is turned off on host and guests. I will get access to the host I was testing on tomorrow.

I have seen power management wreck throughput at low concurrency in the past, but I don't know what it does to long tail latency. Does the CPU really become unavailable during transitions?

Would the best way to fix power management be to change the governors to performance?

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Gil Tene

unread,
Oct 8, 2014, 10:34:06 PM10/8/14
to mechanica...@googlegroups.com
I assume those core counts (16 cores and 12 cores) are with HT off. I'd turn HT on to give you (and KVM, and the guest) many more run queues to play with.

When you quote your "bare metal" numbers, I take it those are on a different machine. An interesting exercise would be to run the exact same test on the KVM host instead of the KVM guest (with no guest booted during the test), and compare hiccup behaviors. That will help determine if it's the host setup that is at fault, or the KVM/guest config.

For the KVM setup, I would highly recommend you go the "unpopular route" and disable THP on the host. KVM setups really like THP, and THP was actually built with KVM in mind. THP feeds KVM with nice 2MB pages to back the guest memory with, and makes everyone happy from an efficiency and trap rate point of view. Unfortunately THP also makes random victim processes occasionally stall for large amounts of time when they ask for memory, making the victim defray an entire zone of memory before they can keep going. And the tradeoff isn't worth it if you are interested in good tail numbers. Instead, you can use statically provisioned hugetlb pages, allocated at boot time, and make KVM back your guest memory with those. This also has the added benefit of preventing the host from paging or swapping your guest's memory (because linux simply lacks the capability for breaking hugely pages up into pieces and paging them, so it's effectively mlocked).

For power management, the best (as in most reliable) way to turn off power management is to turn off the intel idle driver on the KVM host. Do this by adding intel_idle.max_cstate=0 to the kernel arguments in /etc/default/grub .

Ariel Weisberg

unread,
Oct 17, 2014, 2:59:25 PM10/17/14
to mechanica...@googlegroups.com
Hi,

Hyper-threading is enabled and I was just giving the actual core counts.

I have tried disabling THP in host and guest and no dice. Haven't tried completely disabling power management via the kernel boot options yet.

This is what bare metal looks like http://arielweisberg.com/bare-metal.html and this is what KVM looks like http://arielweisberg.com/kvm.html. These charts are the latency at 99.999 for every second.

You can probably ignore the noise at the end of the bare-metal run. There is something environmental running I will need to track down.

I checked the two top spikes in the KVM workload and they are absolutely huge. From flight recorder there is no lock contention reported and no long garbage collections. I don't have access to the output of print application stopped time yet because the harness isn't collecting standard out.

What I can see is that the threads that do the work on a single node are parked on their task queues for the duration of the spike. Client process, and other servers all don't have anything that would indicate they aren't making progress.

I would suspect the database more except it produces the desired numbers on bare-metal.

I will make sure to test power management next.

-Ariel

Gil Tene

unread,
Oct 17, 2014, 4:45:09 PM10/17/14
to mechanica...@googlegroups.com
Can you collect jHiccup logs along with these and post them? Use the -c flag so you'll get a log for the JVM and a matching control log for (and Idle process on) the system at the same time. It would be very useful (from a triage point of view) to see if any of those spikes show up at the JVM or guest OS jHiccup logs. BTW, the later versions of jHiccup log let you extract (with jHiccupLogProcessor) the full histogram for any time range.

For "extra credit", an idle jhiccup log running on the host that covers the same time range that the guest jHiccup logs do would add one more fork to the triage steps. I.e. it can help determine "host ok, problem only seen inside the guest".

Also, There is a clear band of things happening at "exactly" 40msec on the guest that does not show on bare metal. That one is curious on it's own. jHiccup would make it easy to see if this is systemic (to the guest) or local to the JVM (running on the guest), or neither. If it's neither, this could all be be latency i/o behavior on the guest that shows up as application-level latency in your database, without any associated glitches at the JVM or guest OS level.

-- Gil. 

--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/PDEFu3Y3Tcg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to mechanical-symp...@googlegroups.com.

Ariel Weisberg

unread,
Oct 20, 2014, 4:45:02 PM10/20/14
to
Hi,

I still owe you some jHiccup output. I am working with someone else to get the testing done so there is some latency.

On a system with 24 threads he reduced the number of threads in the guest to 16, 8 on each socket, and the results were much better http://arielweisberg.com/reducedvcpus.html. There was no specific vCPU pinning.

The config that was changed was 
<vcpu placement='static' cpuset='0-7,12-19'>16</vcpu>

We still need to run some other combinations to home in on what is the right strategy, but this is good enough for us.

-Ariel
-- Gil. 

To unsubscribe from this group and all its topics, send an email to mechanical-sympathy+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages