Here's the approach we use. It's based in practical validation not lab-verified results, so your mileage will vary, and these settings in a different use case might be catastrophic. Give feedback if anything seems questionable or poorly explained.
==== GC Tuning (WIP adaptation from Big Data for Chimps)
Refs:
* [JVM Performance Tuning at Twitter](http://www.slideshare.net/aszegedi/everything-i-ever-learned-about-jvm-performance-tuning-twitter) -- if you read onlyone thing, its that
* [Good Overview](http://www.cubrid.org/blog/dev-platform/understanding-java-garbage-collection/) of GC terms.
* [Technical Documentation](http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html)
______________
Tuning a dataflow system is easy:
* Ensure each stage is always ready to accept records, and
* Deliver each processed record promptly to its destination
That may seem simplistic, but it says that this is a pure-case generational GC problem: there are a few large blobs (cache map, etc), but mostly you have small tuples that live fast and die young. That's good.
Your tuning goal is that in steady-state the workers show
* No stop-the-world (STW) gc's, and nothing in the logs about aborted CMS
* old-gen GCs should not last longer than 1 second or happen more often than every 5 minutes
* new-gen GCs should not last longer than 50 ms or happen more often than every 5 seconds
* new-gen GCs should not fill the survivor space
* perm-gen occupancy should be constant
___________________
To start, have the JVM log GC activity at max verbosity -- see pull request #632 -- and get visualVM or YourKit hooked up to a worker running your flow in a production steady state.
-Xloggc:%STORMHOME%/logs/gc-worker-%ID%.log -verbose:gc
-XX:GCLogFileSize=1m -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10
-XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -XX:+PrintClassHistogram
-XX:+PrintTenuringDistribution -XX:-PrintGCApplicationStoppedTime
First, roughly size the old-gen and perm-gen sizes. Set the initial perm-gen space to say 150% of what you see and its max to be twice what you see. For us, this led to an initial perm-gen size of 96m (a bit generous) and a hard cap of 128m (this should not change much after startup, so I want the process to die hard if it does).
For many flows the biggest demand on old-gen space comes from long-lived state objects, like a persistentAggregate's CacheMap or a streaming join table. Force a GC (using eg visualVM) and look for how much old-gen is being used: let's call that `OG_used`. If the number is less than 2GB, use 2GB instead.
We want to start with a system that isn't having old-gen GC's, so size the heap generously: heap = (perm-gen size + 1500 MB new gen + OG_used). If that figure is more than 75% of the system ram, find a bigger machine to test on: we also don't want to think about swap or OS cache pressure.
In this example, let's say we figured OG_used at 2000MB and a perm-gen size of 96mb. We'll start with an initially oversized new-gen of 1600MB, giving a heap of (2000+1600+96)=~3700, and the following starting set of values:
-Xmx3700m -Xms3700m -Xss256k -XX:MaxPermSize=128m -XX:PermSize=96m
-XX:NewSize=1800m -XX:MaxNewSize=1800m -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=6
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-server -XX:+AggressiveOpts -XX:+UseCompressedOops -Djava.net.preferIPv4Stack=true
Almost all the objects running through storm are short-lived -- that's what the First Rule of data stream tuning says -- so almost all your activity is in the new-gen. The new gen should gradually build to full, triggering a new-gen GC. Adjust the new-gen size so that they happen infrequently (5 or more seconds apart) and with minimal pause (<50ms).
When a new-gen GC occurs, objects copied into the survivor space should not fill it completely, and virtually no objects should survive to the second tenure. Adjust the new-gen size (larger means objects have more time to become unwanted) and the survivor ratio (lower value makes a larger survivor space) until you see that large drop-off from eden-first survivor and first-second. At this point, turn the max tenuring back to 1. (Since almost every object is handled in one new-gen GC period, only the fraction created just before a GC will hit the survivor space. So few things survive two passes that we should promote them out of the way immediately.)
Meanwhile, the old gen should average 50% utilization and increase only slowly (if at all). Your old-gen size should be the larger of a) 150-200% of its steady-state observed size, or b) slightly larger than the new-gen size. (It must be able to accommodate any object the new gen could hold). The old-gen collection should take less than 1000ms, and not occur more than once every five minutes. (It probably won't be anywhere near that often once the new gen is sized up.)
You can now try to reduce the memory footprint from this happy but overly generous state. Decrease the new-gen size until you notice either an increase in frequency or an increase in survivor occupancy. Once you do, back off by a good margin for your production value. Next reduce the overall heap size so that your old-gen occupancy has a nice sawtooth from say 40%-75%-40%-75%-... Make sure you don't reduce it too far -- the old-gen must be larger than the new-gen.
Whether we're tuning the overall flow for latency or throughput, we tune the GC for latency (low pause times). Since things like committing a batch can't proceed until the last element is received, local jitter induces global drag: it's a latency-throughput coupling. So we use the CMS low-pause collector for the old gen, and lower the initiating fraction (more frequent but shorter GCs).
Don't oversize the heap. The larger the heap, the more time a full GC takes. Even at heap sizes of 8GB GC pauses on heavily-loaded machines can cause enough jitter to start slowing down transactional portions of the flow (another latency-throughput coupling). In the worst case, an overly-large heap can take so long to GC that it times out socket connections. That causes the cluster to rebalance, which drives up load, which causes heap pressure, which causes full GCs and Christmas is ruined. I don't like running JVMs with more than 12GB heap if I can avoid it. (Also know that at 32GB your pointer size increases, taxing you a few GB)