On Wednesday, February 6, 2013 9:46:59 AM UTC-8, Hussam Mousa wrote:
An important consideration I didn't see mentioned is memory and cache contention (I suppose the latter can be bundled under CPU contention from 25000 feet above).
Yup. Waste due to TLB and Cache behavior are the likely candidates in these cases (it's usually not memory bandwidth). CPU contention (as seen by top, for example) includes pretty much anything that makes your process occupy the cpu time slot from an OS point of view. When you see a drop in performance on a saturated system (100% busy cpu), it usually means that you get less work done per cpu time quantum, and that's usually because of some sort of efficiency drop.
The most likely cause of inefficiency in such saturated cases in the cost context switching between processes and address spaces. Threads within the the same Java process share the same virtual-to-physical mappings, and tend to have good constructive sharing of cache at the L2 and L3 levels. Each time you context switch between processes, your CPU's TLBs are flushed, your L1 cache (and likely the per-core 256KB L2) contents are useless for the incoming process, and your L3 cache footprint is competing with the L3 cache footprint of the other process (and they both live inside the same 20MB shared L3). So TLB misses, L2 misses, and L3 misses are likely higher.
Odds are, that if you are running exact replicas (JVM + APP stack) the linear addresses in both processes are the same which will mean they are hitting the same cache lines and completely thrashing each other (especially true if you are using Hyper Threading - which you probably are, and should be)
Umm.. that part doesn't work that way.... Idnetically laid out processes do not hurt. The address layout of two competing processes will have no practical effect on how cache contention between them works, in the sense that they are no more or less likely to compete for specific cache lines. There can be cache contention, for sure, but having different memory layouts (same amount of stuff but in different addresses) will not make any difference. The caches are physically addressed, and since two identically laid out processes will still occupy different physical pages, you'll see roughly the same statistical spread of cache line indexes regardless of virtual address.