Multiple JVMs Performance Degradation

763 views
Skip to first unread message

Chase

unread,
Feb 5, 2013, 10:16:51 PM2/5/13
to mechanica...@googlegroups.com
Hello All - Recently I have encountered an interesting test result running multiple JVMs on a single VMware guest.  I thought it might be good to post to the group here to see if people may have some ideas on what could be going on.  The architecture is pretty simple:

Hardware E5-2680 Dual Socket - with VMware ESXi 5.1
2 VMs with 6 cores each and 20 GB of memory allocated
- Each VM runs 2 JVMs with 2 GB heaps
The is a 3rd VM running an in-memory  DB with 4 cores and most of the remaining memory.

What was interesting if I killed 1 of the JVMs the 1st JVM actually processed more TPS  then 2 JVMs together and used less about the same as the original JVM ( not 2x the CPU).    I realize that VMs are not ideal for high performance environments but they are quite common so could be a common issue.  I believe the original architecture operated on the assumption of scale out many small JVMs with small heaps to reduce pause time.  However, it seems that these results indicate that 1 larger JVM is much more efficient then many smaller JVMs.  I am going to run some more tests but I thought this group may have some good ideas on why such a significant delta between 1 vs 2 JVMs.  

Thanks,
Chase

Michał Warecki

unread,
Feb 6, 2013, 3:01:31 AM2/6/13
to Chase, mechanica...@googlegroups.com
Hi,

I think all depends on application you are running. If this application is CPU bound and uses all cores at the same time then doubling VMs will double number of threads and increase number of context switches. More threads != better performance.
However more VMs may improve performance.

I'm not an expert so I could be wrong :-)

Cheers,
Michał

2013/2/6 Chase <chase.w...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Peter Lawrey

unread,
Feb 6, 2013, 4:56:55 AM2/6/13
to Chase, mechanica...@googlegroups.com
In my experience, it is very easy to over utilise a shared resources when you have multiple thread and processes.  When this happens you get little or not performace gain and possible a slow down due to contention.  I would expect when you get these surprising slow downs it is due to an over utilised resource.  What do these VMs share?  Do they share sockets, memory banks, disk drives, network adapter?


--

Kirk Pepperdine

unread,
Feb 6, 2013, 11:27:21 AM2/6/13
to Chase, mechanica...@googlegroups.com
This problem could be anywhere but it suggests that you have a common dependency that is external to the JVMs.

Regards,
Kirk

Gil Tene

unread,
Feb 6, 2013, 11:28:57 AM2/6/13
to mechanica...@googlegroups.com
As others note, this may be a simple case of CPU contention. Two JVMs sharing the same saturated CPUs will often perform [in throughout metrics] worse than one.

While VMWare adds all sorts of variables, this MAY not have anything to do with VMWare. Do you have any reason to believe that a single JVM running your workload is not able to saturate 6 cores? Have you tried this with only a single guest? I.e. compare running one or two JVMs on one of those 6 core guests with 20GB, with absolutely nothing else running on the VMWare host (one one booted guest). If you still find the same behavior, my guess would be that you would find it on a physical (non virtualized) machine too, if it had only 6 or 8 cores...

However, focusing on TPS [alone] as a metric may be one of the issues. You noted that the original architecture probably chose to use multiple smaller JVMs to reduce pause times. Reducing or controlling pause times is usually not motivated by trying to maximize TPS or throughput metrics. It's usually motivated by wanting to control worst-case and high-percentile response time behavior. Measuring TPS alone will not tell you whether or not that original choice is still needed. It may be that the choice was actually to sacrifice peak TPS ability for a more contained response time outlier level. To figure this out, you may want to compare the distribution of response times or transactions times between the one and two JVM setups, and if one JVM has a better response time profile than two [for whatever "better" means in your business case], you've probably established that there is no need for 2 JVMs. [jHiccup is a useful means for getting a feel for how those distributions behave without actually putting work into measuring them at the application level].

A word of caution: before you draw conclusions about response time behavior from lab tests, makes sure your tests last long enough to see the worst thing that pauses will do to you. E.g. look in production logs for peak pause behavior, and don't believe your test results until they include similar pause lengths.

Chase

unread,
Feb 6, 2013, 11:55:37 AM2/6/13
to mechanica...@googlegroups.com
All good points - I am running some more tests multiple hours.  It looks like it is a CPU contention issue at this point.

Hussam Mousa

unread,
Feb 6, 2013, 12:46:59 PM2/6/13
to Chase, mechanica...@googlegroups.com
An important consideration I didn't see mentioned is memory and cache contention (I suppose the latter can be bundled under CPU contention from 25000 feet above). 
Odds are, that if you are running exact replicas (JVM + APP stack) the linear addresses in both processes are the same which will mean they are hitting the same cache lines and completely thrashing each other (especially true if you are using Hyper Threading - which you probably are, and should be)

More likely than not, you are memory/cache bound vs CPU bound. Also check your JVM structure sizes (perm, stack, code Cache, ...). Another likely situation is tremendous amounts of paging. You can quickly check your paging behavior for 1 vs 2 JVMs using vmstat.

An experiment worth considering is sizing a VM just enough to saturate a single JVM, and then scale your system by cloning VMs. So instead of 4VMs x 2JVMs each, do 8VMs x 1VM each. This should work to resolve some of your cache contention and paging issues.


--

Gil Tene

unread,
Feb 6, 2013, 1:36:33 PM2/6/13
to mechanica...@googlegroups.com, Chase


On Wednesday, February 6, 2013 9:46:59 AM UTC-8, Hussam Mousa wrote:
An important consideration I didn't see mentioned is memory and cache contention (I suppose the latter can be bundled under CPU contention from 25000 feet above).

Yup. Waste due to TLB and Cache behavior are the likely candidates in these cases (it's usually not memory bandwidth). CPU contention (as seen by top, for example) includes pretty much anything that makes your process occupy the cpu time slot from an OS point of view. When you see a drop in performance on a saturated system (100% busy cpu), it usually means that you get less work done per cpu time quantum, and that's usually because of some sort of efficiency drop.

The most likely cause of inefficiency in such saturated cases in the cost context switching between processes and address spaces. Threads within the the same Java process share the same virtual-to-physical mappings, and tend to have good constructive sharing of cache at the L2 and L3 levels. Each time you context switch between processes, your CPU's TLBs are flushed, your L1 cache (and likely the per-core 256KB L2) contents are useless for the incoming process, and your L3 cache footprint is competing with the L3 cache footprint of the other process (and they both live inside the same 20MB shared L3). So TLB misses, L2 misses, and L3 misses are likely higher.
 
 
Odds are, that if you are running exact replicas (JVM + APP stack) the linear addresses in both processes are the same which will mean they are hitting the same cache lines and completely thrashing each other (especially true if you are using Hyper Threading - which you probably are, and should be)
 
Umm.. that part doesn't work that way.... Idnetically laid out processes do not hurt. The address layout of two competing processes will have no practical effect on how cache contention between them works, in the sense that they are no more or less likely to compete for specific cache lines. There can be cache contention, for sure, but having different memory layouts (same amount of stuff but in different addresses) will not make any difference. The caches are physically addressed, and since two identically laid out processes will still occupy different physical pages, you'll see roughly the same statistical spread of cache line indexes regardless of virtual address.

Kirk Pepperdine

unread,
Feb 7, 2013, 2:04:07 AM2/7/13
to Gil Tene, mechanica...@googlegroups.com, Chase
and the answer doesn't even have to be this complex if you know the external dependencies and how the app competes for them.

Kirk Pepperdine

unread,
Feb 7, 2013, 2:05:04 AM2/7/13
to Gil Tene, mechanica...@googlegroups.com, Chase
Hit send too soon. What happens if you run the app on different bits of hardware or/and then different VMs?

On 2013-02-06, at 7:36 PM, Gil Tene <g...@azulsystems.com> wrote:

Reply all
Reply to author
Forward
0 new messages