I am working with a requirement to keep latency at P99.999 under 30 milliseconds over several minute time scales under RHEL 6.3 and KVM. This is in an environment where currently there is only a single guest VM although I am told there are plans to run a second VM for client applications.
After doing a few tests it's pretty clear that the performance after P99 is degraded and once you start clustering and synchronously replicating the wheels come off. Right now I am measuring an in-memory database without persistence enabled so disk IO is not a factor.
From an application perspective I have seen this manifest in diverse ways. Time to safe point is randomly long, young gen GCs are generally long with some outliers that are extra long without a corresponding increase in live objects, native function calls to read/write from sockets or invocations of Selector.wakeup() randomly take a long time, canary threads occasionally don't get to run on schedule. Some times the long native calls are obviously delayed reentry due to GC, but some times not.
To give a perspective based on various performance measurements. Running jHiccup on dummy app that does nothing on an idle system.
jHiccup on my desktop gives
0.13 milliseconds at 99.9
0.44 milliseconds at 99.99
5.87 milliseconds at 99.999
On one of the VMs
10.03 milliseconds at 99.9
11.08 milliseconds at 99.99
11.27 milliseconds at 99.999
On a bare metal server
0.16 milliseconds at 99.9
0.41 milliseconds at 99.99
0.77 milliseconds at 99.999
Under KVM and running a very basic benchmark with 10 concurrent threads has a P99.999 of 20 milliseconds on a single node (client and server same node). With a 3 node cluster the latency jumps to 40+ milliseconds at 5 9s. This is running on Sandy Bridge with 16 cores and one core not presented to the VM.
By way of comparison that same benchmark running on a bare metal server (Sandy Bridge, 12 cores) has a max latency of 6 milliseconds. Running on my desktop the P99.999% is 6 milliseconds and the max is 9. We do the clustered version of a similar benchmark in continous integration and the P99.999 is 5.76.
I have found recommendations for doing things like pinning virtual cores to physical and presenting cores as their own NUMA nodes to cause the guest to lock split. Has anyone tried these tricks or others to get decent behavior out of KVM?