JVMs on single cores but parallel JVMs.

896 views
Skip to first unread message

Kevin Burton

unread,
Sep 17, 2013, 4:39:48 PM9/17/13
to mechanica...@googlegroups.com
We've started moving to using one JVM per core.  It wastes a *bit* of memory but in situations where the JVM has locks we can't mitigate, due to not controlling the code, it prevents us from locking other threads running on other CPUs.

This brings up the issue of the JVM making silly decisions for parallelizing GC.  If you only *have* one core a parallel GC is silly.

So -XX:ParallelGCThreads=1 -XX:ConcGCThreads=1 ... on G1 would be needed.  It's just going to be a waste of time to use multiple cores as it's only going to slow you down.

Has anyone else here taken this strategy?

Kevin

Chris Vest

unread,
Sep 17, 2013, 4:46:20 PM9/17/13
to mechanica...@googlegroups.com, mechanica...@googlegroups.com
I recall something about a particular single-threaded mode, with a single-threaded GC, that the JVM can be in when you restrict it to a single core.

I don't remember any specifics about this though. Could be worth looking into.

(Sent from a phone)

Chris
--
You received this message because you are subscribed to the Google Groups "mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Peter Lawrey

unread,
Sep 17, 2013, 5:56:31 PM9/17/13
to mechanica...@googlegroups.com

I have one or two critical thread(s) per jvm on an isolated core but allow all the other threads incl gc to run on junk cores. This means the critical thread doesn't compete for cpu with the other threads in the same jvm (or other jvms)

I wouldnt suggest limiting the entire jvm to one core as it means you have to context switch you most critical thread to run your least important.

--

Kevin Burton

unread,
Sep 17, 2013, 5:58:46 PM9/17/13
to mechanica...@googlegroups.com
you mean -XX:+UseSerialGC

I think I remember reading somewhere that this was going to eventually be removed.

Also, I imagine there are other optimizations in G1 that don't have to do with multi-cores so using G1 with a parallelism of 1 might make more sense.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-sympathy+unsub...@googlegroups.com.

Jason Koch

unread,
Sep 17, 2013, 7:09:51 PM9/17/13
to mechanica...@googlegroups.com, mechanica...@googlegroups.com
I've never used it but CMS has an incremental mode that's supposed to be helpful when running on single cores (if Serial is too long on stw). It basically requests CMS threads to yield. I would assume your mark & remark are still stw.
To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Martin Thompson

unread,
Sep 18, 2013, 3:27:07 AM9/18/13
to mechanica...@googlegroups.com

Kirk Pepperdine

unread,
Sep 18, 2013, 7:26:38 AM9/18/13
to mechanica...@googlegroups.com
I complained bitterly about this months ago and it was like pissing into the wind.... :-( Well, they first wanted it out of 8 but I complained loudly enough that they decided to deprecate it in 8 and work towards removal in 9. The issue is the amount of test cycles it takes is apparently fairly large.

I've one more shot and that at J1 where I'll have Breakfest with Oracle exec's on Sunday morning. If there is anyway to get any noise to back me up on getting the deprecation decision reversed.

-- Kirk

To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Martin Thompson

unread,
Sep 18, 2013, 9:07:49 AM9/18/13
to mechanica...@googlegroups.com
I think that if the powers that be get their way then CMS as a whole will be deprecated in favour of G1.

Kirk Pepperdine

unread,
Sep 18, 2013, 9:15:06 AM9/18/13
to mechanica...@googlegroups.com

I think that if the powers that be get their way then CMS as a whole will be deprecated in favour of G1.

G1 in _40 is better than it has been and I've been told that the next releases are better. Just last week I finally was able to get better performance out of a bench using default G1 settings than I was able to get out of a tuned CMS run. I've also found solutions to problems that I've had in the past trying to get Eden to adapt to a small enough size so that the evacuation pause wouldn't blow the max pause time we could tolerate. That said, you really need to know a *lot* about the collector to get really good results so while things are getting better with G1,  I still don't believe it approaches C4 in many aspects.

-- kirk


To unsubscribe from this group and stop receiving emails from it, send an email to mechanical-symp...@googlegroups.com.

Martijn Verburg

unread,
Sep 19, 2013, 11:17:19 AM9/19/13
to mechanica...@googlegroups.com
A frustrating aspect for me is that G1 was oversold when they first stated it was ready for production. I agree with Kirk that 7u40 is looking much better (anecdotally from a few test runs on PCGen, Eclipse, Censum and other desktop tools), but I still wouldn't be happy with CMS being replaced outright until there's more proof (yeah I know, we're supposed to be one of the companies looking into that ;p).

Cheers,
Martijn

Rüdiger Möller

unread,
Sep 21, 2013, 9:46:14 AM9/21/13
to mechanica...@googlegroups.com
G1 also has issues with BLOBs (large arrays of primitive types).

Gil Tene

unread,
Sep 21, 2013, 11:39:33 AM9/21/13
to mechanica...@googlegroups.com
Back to the original topic, running enough JVMs such that there is only 1 core per JVM is not a good idea unless you can accept regular multi-tens-of-msec pauses even when no GC is occurring in your JVM. I'd recommend sizing for *at least* 2-3 cores per JVM unless you find those sort of glitches acceptable.

The reasoning is:

[assuming you are not isolating JVMs to dedicated cores that have nothing else running in them, whic has its own obvious problems]

GC:
even if you limit GC to using one thread, that one GC thread can be running concurrently with your actual application threads for long periods of time (e.g. during marking and sweeping in CMS, or during G1 marking). If there was only one core per JVM, then when any one JVM is active in GC at least one other JVM's application threads will have entire scheduling quantums stolen from it.

Before people start thinking "this will be rare", let me point out that with many JVMs some GC is more likely to be active at any given time. E.g. If you ran 12 JVMs on a 12 vcore machine, and each JVM had a very reasonable 2% duty cycle (not necessarily pause time, but time in GC cycle compared to time when no GC is active) then there would be some sort of quantum-stealing-from-application-threads GC activity going on roughly 25% of the wall clock time even if GCs were perfectly interleaved (which they won't be), and if they weren't perfectly interleaved there would be multiple of those going on. Under full load, such a setup will translate into a 98%-99%'ile that is at least as large as an OS scheduling quantum, and under lower loads those quantum-level hiccups will only move slightly higher I percentiles (e.g. even at only 5%-10% load your 99.9% will still be around a 10msec).

Other JVM stuff:
The JVM has other, non-GC work that it uses JVM threads for. E.g. JIT compilation will cause storms of compiler activity that runs concurrently with the app. While GC does tend to dominate over time, limiting GC threads to 1 does not cap the number of concurrent, non-application thread work that the JVM does.

Application threads:
Unless your Java application is purely single threaded, there will be bursts of time where one JVM has multiple runnable application threads active. Whenever those occur when there is only one-core-per-JVZm sizing, application threads across JVM will be contending for cores and scheduling-quantum-sized delays will be incurred.

Bottom line: if you never want to see quantum-sized delays in your apps, you need as many cores in the system as the total possible concurrently runnable threads across all JVMs (app threads plus GC threads). If you are willing to occasionally experience quantum-level delays, you can relax that a bit, but be aware that the higher the load on your system us, the higher your cross-JVM effects will start mounting, and that even at relatively low average loads (5-10%) you will start seeing very frequent glitches in the tens of msec.

Kevin Burton

unread,
Sep 21, 2013, 12:30:28 PM9/21/13
to mechanica...@googlegroups.com
ok... so I'll rephrase this a bit.

You're essentially saying that GC and background threads will need to run to prevent foreground threads from stalling. GC , network IO, background filesystem log flushes, etc.

... and if you're only running on ONE core this will preempt your active threads and will increase latency.

And I guess the theory is that if you have another core free, why not just let that other core step in and help split the load so you can have a smaller "stop the world" interval.

I guess that makes some sense and probably applies to a lot of workloads.

Some points:

 - in our load, we are usually about 100% CPU on the current thread, and 100% on the other CPU... so if we trigger GC in the core, the secondary core isn't going to necessarily execute faster.  In fact it might execute slower due to memory locality (depending on the configuration).  I think in most situations, applications are over-provisioned to account for load spikes so this setup might actually warrant deployment as it would work in practice.

- This idea is partially a distributed computing fallacy.  This GC doesn't scale to hundreds of cores...If you're on a 64 core machine splitting out your VMs so they are smaller, with the entire working set local to that CPU, and segmenting GC to that core, seems to make the most sense.  You would have GC pauses but they would be 1/Nth (where N = num of cores) of your entire application GC hit.

- You can still use a CMS approach here where you GC in the background, it's just done on one core with another thread. 

- GC isn't infinitely parallel... You aren't going to send part of your heap over the network and do a map/reduce style GC across 1024 servers within a cluster.  Data locality is important.  Keeping the JVMs small and local to the core and having lots of them seems to make a lot of sense.

- the fewer JVMs you have the more JDK lock contention you can have.  Things like SSL are still contended (yuk) ... though JDK 1.7 has definitely improved the situation.

... one issue of course is that OpenJDK doesn't share the permanent generation classes.  So you see like a 128MB hit per JVM.  This works out to about $2 per month per JVM for us so not really the end of the world.

Kevin

Gil Tene

unread,
Sep 21, 2013, 2:47:19 PM9/21/13
to <mechanical-sympathy@googlegroups.com>
Lots of stuff here, so responses inline.

On Sep 21, 2013, at 9:30 AM, Kevin Burton <burto...@gmail.com>
 wrote:

ok... so I'll rephrase this a bit.

You're essentially saying that GC and background threads will need to run to prevent foreground threads from stalling. GC , network IO, background filesystem log flushes, etc.

... and if you're only running on ONE core this will preempt your active threads and will increase latency.

And I guess the theory is that if you have another core free, why not just let that other core step in and help split the load so you can have a smaller "stop the world" interval.


While using more cores can sometimes help cut down on stop-the-world pauses, I wasn't talking about stop-the-world events. Just stop-a-thread-from-running events. Thats where the biggest cross-JVM disruption is usually seen. One JVM is not going to make another one stop the world, but when it spikes it's CPU use for a sustained amount of time, it will steal cpu in unit of scheduling quanta (which are on the order of ~10msec long per quantum on most platforms), so a perfectly runnable application thread on a JVM that had no reason to stop may find itself with a 30msec pause during which it simply had no cpu to run on.

So this is about work that could be done concurrently with you application running, but isn't because no empty cores were available for that. Small (per operation) pieces of work like interrupt or packet handling and controlling i/o are usually short enough to avoid ever seeing an entire ~10msec quantum stolen, but JIT compilers and the concurrent parts of garbage collectors certainly do. E.g. both CMS and G1 have plenty of long running concurrent work they do, where the marker in both scans the entire live set and can mean seconds of non-stop pointer chasing for one GC thread to do. That concurrent work can be (and normally is) done by other cores when they are not all taken up by other work, but filling all the cores up turns this concurrent work into stop-other-threads work.

I guess that makes some sense and probably applies to a lot of workloads.

Some points:

 - in our load, we are usually about 100% CPU on the current thread, and 100% on the other CPU... so if we trigger GC in the core, the secondary core isn't going to necessarily execute faster.  In fact it might execute slower due to memory locality (depending on the configuration).  I think in most situations, applications are over-provisioned to account for load spikes so this setup might actually warrant deployment as it would work in practice.

That over-provisioning is there for a reason, if you use the empty space that was kept there for headroom and place lots of JVMs in it, you'll end up over-subsribing, and things will hurt especially during peaks.

Oversubscription is not a terrible thing from a raw throughput perspective, as long as responsiveness and latency profiles are not importsnt, like in batch apps. E.g. Hadoop cluster that take 3 hours to answer something will usually finish faster if somewhat oversubscribed compared to being undersubscribed. However, whenever latency and responsiveness actually matters, it's usually the high percentiles and max times that actually matter, and those are dramatically affected by oversubscription.


- This idea is partially a distributed computing fallacy.  This GC doesn't scale to hundreds of cores...If you're on a 64 core machine splitting out your VMs so they are smaller, with the entire working set local to that CPU, and segmenting GC to that core, seems to make the most sense.  You would have GC pauses but they would be 1/Nth (where N = num of cores) of your entire application GC hit.

- You can still use a CMS approach here where you GC in the background, it's just done on one core with another thread. 


The problem with 1-core-per-JVM is not making pauses takes longer, it's creating pauses that would not ordinarily exist. As noted, I'm talking purely about cross-JVM disruption and about disrupting threads due to scheduling contention on common cores.

- GC isn't infinitely parallel... You aren't going to send part of your heap over the network and do a map/reduce style GC across 1024 servers within a cluster.  Data locality is important.  Keeping the JVMs small and local to the core and having lots of them seems to make a lot of sense.

Locality helps application threads, but GC doesn't really gain (or lose) anything from it. The bulk of the work for all GCs is in by-definition cache-missing operations. Pointer chasing (marking or single pass copying) and streaming (sweeping or compacting evacuators) both have practically no locality benefits.


- the fewer JVMs you have the more JDK lock contention you can have.  Things like SSL are still contended (yuk) ... though JDK 1.7 has definitely improved the situation.


Lock contention can certainly become an issue above some number of cores, but it's usually an application logic thing and not a JVM thing. Still, it's a perfectly good reason to say you need extra JVMs to scale, expecially if (as you note in your example) you don't control all the code and can't otherwise address the bottleneck. However, the more JVMs you run the more jittery and less efficient things will be, and the more scheduling contention spikes will occur. Your best bet is to scale your JVM instance to where it naturally goes to before either lock contention of GC pauses make you not want to go further, and then use that size of JVM to "laterally" scale within a box if that single JVM size can't fill it up. Keeping headroom (empty cores) to accommodate inidividual JVM work spikes (and overlapping spikes across JVMs) while doing so is important, and the more JVMs you have the bigger those spikes can get.

I think that you'll almost never find this all to mean 1 core per, and usually stop at 2-3 cores per JVM (or higher).

... one issue of course is that OpenJDK doesn't share the permanent generation classes.  So you see like a 128MB hit per JVM.  This works out to about $2 per month per JVM for us so not really the end of the world.

Kevin


On Saturday, September 21, 2013 8:39:33 AM UTC-7, Gil Tene wrote:
Back to the original topic, running enough JVMs such that there is only 1 core per JVM is not a good idea unless you can accept regular multi-tens-of-msec pauses even when no GC is occurring in your JVM. I'd recommend sizing for *at least* 2-3 cores per JVM unless you find those sort of glitches acceptable.

The reasoning is:

[assuming you are not isolating JVMs to dedicated cores that have nothing else running in them, whic has its own obvious problems]

GC:
even if you limit GC to using one thread, that one GC thread can be running concurrently with your actual application threads for long periods of time (e.g. during marking and sweeping in CMS, or during G1 marking). If there was only one core per JVM, then when any one JVM is active in GC at least one other JVM's application threads will have entire scheduling quantums stolen from it.

Before people start thinking "this will be rare", let me point out that with many JVMs some GC is more likely to be active at any given time. E.g. If you ran 12 JVMs on a 12 vcore machine, and each JVM had a very reasonable 2% duty cycle (not necessarily pause time, but time in GC cycle compared to time when no GC is active) then there would  be some sort of quantum-stealing-from-application-threads GC activity going on roughly 25% of the wall clock time even if GCs were perfectly interleaved (which they won't be), and if they weren't perfectly interleaved there would be multiple of those going on. Under full load, such a setup will translate into a 98%-99%'ile that is at least as large as an OS scheduling quantum, and under lower loads those quantum-level hiccups will only move slightly higher I percentiles (e.g. even at only  5%-10% load your 99.9% will still be around a 10msec).

Other JVM stuff:
The JVM has other, non-GC work that it uses JVM threads for. E.g. JIT compilation will cause storms of compiler activity that runs concurrently with the app. While GC does tend to dominate over time, limiting GC threads to 1 does not cap the number of concurrent, non-application thread work that the JVM does.

Application threads:
Unless your Java application is purely single threaded, there will be bursts of time where one JVM has multiple runnable application threads active. Whenever those occur when there is only one-core-per-JVZm sizing, application threads across JVM will be contending for cores and scheduling-quantum-sized delays will be incurred.

Bottom line: if you never want to see quantum-sized delays in your apps, you need as many cores in the system as the total possible concurrently runnable threads across all JVMs (app threads plus GC threads). If you are willing to occasionally experience quantum-level delays, you can relax that a bit, but be aware that the higher the load on your system us, the higher your cross-JVM effects will start mounting, and that even at relatively low average loads (5-10%) you will start seeing very frequent glitches in the tens of msec.










--
You received this message because you are subscribed to a topic in the Google Groups "mechanical-sympathy" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/mechanical-sympathy/iiafelfOYXo/unsubscribe.
To unsubscribe from this group and all of its topics, send an email to mechanical-symp...@googlegroups.com.

awei...@voltdb.com

unread,
Sep 23, 2013, 10:58:21 AM9/23/13
to mechanica...@googlegroups.com
What I have been thinking about is one JVM per socket or per group of cores as the # of cores per socket continues to increase. I would focus on keeping as much communication as possible local to the current NUMA node, or to smaller units of shared caches. If you are building a distributed system already it shouldn't be a huge deal to shard down one more level. The tooling for this is also convenient with numactl allowing you to bind execution and memory allocation to a numa node.

Solutions that work reasonably well across a variety of hardware are better for me. You might be able to do better, but if you run a JVM per socket (or group) you probably won't do worse then a single JVM per node and that makes it easy to automate.

I did one benchmark this way and got linear scale up, but the contention could just have easily been in the application.

Ariel

Alexander Turner

unread,
Sep 24, 2013, 2:39:13 AM9/24/13
to mechanica...@googlegroups.com
Out of curiosity, how can one have a lock which cannot be mitigated within a process but does not damage application logic when that logic is distributed across multiple processes? After all, processes are simply threads which have isolated address spaces. I expect you did not want someone to ask this - so I apologise. Nevertheless, it is the key question here because it goes to the nature of the problem you are solving rather than the symptom you are mitigating.

Gil Tene

unread,
Sep 24, 2013, 2:45:55 AM9/24/13
to mechanica...@googlegroups.com
It's actually quite common, especially when the bottleneck is in a some infrastructure piece of software you don't control. E.g. your logger (like log4j), or your protocol processor, or something else in your container may have a process-local contention-based bottleneck (often a lock around a synchronous operation) which will prevent further scaling within the process without rewriting or rejiggering software that you don't want to touch. But when your system is already built to run across multiple instances, this process-local bottleneck will often not affect cross-process scaling. That's a job for other, more interesting bottlenecks.
Reply all
Reply to author
Forward
0 new messages