Tuning garbage collection for Storm

2,606 views
Skip to first unread message

G Gordon Worley III

unread,
Jul 23, 2013, 1:51:45 PM7/23/13
to storm...@googlegroups.com
I came across an interesting article describing how garbage collection works in Java:


I've been meaning to tackle this problem for some time, since we have occasionally seen workers that spend minutes doing nothing but garbage collection. Generally something else was wrong, but the point remains that it would likely be helpful to tune garbage collection on storm workers to optimize performance.

There is a post from December of 2011 on this list about the issue to which the response was "it depends". I agree that is the case but I was curious to see if anyone has looked into this issue and would be willing to share their settings (and the workload they're using with those settings).

Flip Kromer

unread,
Jul 24, 2013, 3:01:45 AM7/24/13
to storm...@googlegroups.com

tl;dr: increase the heap size, but not by as much as you'd think; new-gen size very aggressive; lightly tune the tenuring; put in the CMS old-gen. An example of production settings are at https://github.com/nathanmarz/storm/pull/632/files#L1R53

Here's the approach we use. It's based in practical validation not lab-verified results, so your mileage will vary, and these settings in a different use case might be catastrophic. Give feedback if anything seems questionable or poorly explained.

==== GC Tuning (WIP adaptation from Big Data for Chimps)

Refs:

* [JVM Performance Tuning at Twitter](http://www.slideshare.net/aszegedi/everything-i-ever-learned-about-jvm-performance-tuning-twitter) -- if you read onlyone thing, its that
* [Good Overview](http://www.cubrid.org/blog/dev-platform/understanding-java-garbage-collection/) of GC terms.
* [Technical Documentation](http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html)

______________

Tuning a dataflow system is easy:

* Ensure each stage is always ready to accept records, and
* Deliver each processed record promptly to its destination

That may seem simplistic, but it says that this is a pure-case generational GC problem: there are a few large blobs (cache map, etc), but mostly you have small tuples that live fast and die young. That's good.

Your tuning goal is that in steady-state the workers show

* No stop-the-world (STW) gc's, and nothing in the logs about aborted CMS
* old-gen GCs should not last longer than 1 second or happen more often than every 5 minutes
* new-gen GCs should not last longer than 50 ms or happen more often than every 5 seconds
* new-gen GCs should not fill the survivor space
* perm-gen occupancy should be constant

___________________

To start, have the JVM log GC activity at max verbosity -- see pull request #632 -- and get visualVM or YourKit hooked up to a worker running your flow in a production steady state.

-Xloggc:%STORMHOME%/logs/gc-worker-%ID%.log -verbose:gc
-XX:GCLogFileSize=1m -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10
-XX:+PrintGCDetails -XX:+PrintHeapAtGC -XX:+PrintGCTimeStamps -XX:+PrintClassHistogram
-XX:+PrintTenuringDistribution -XX:-PrintGCApplicationStoppedTime

First, roughly size the old-gen and perm-gen sizes. Set the initial perm-gen space to say 150% of what you see and its max to be twice what you see. For us, this led to an initial perm-gen size of 96m (a bit generous) and a hard cap of 128m (this should not change much after startup, so I want the process to die hard if it does).

For many flows the biggest demand on old-gen space comes from long-lived state objects, like a persistentAggregate's CacheMap or a streaming join table. Force a GC (using eg visualVM) and look for how much old-gen is being used: let's call that `OG_used`. If the number is less than 2GB, use 2GB instead.

We want to start with a system that isn't having old-gen GC's, so size the heap generously: heap = (perm-gen size + 1500 MB new gen + OG_used). If that figure is more than 75% of the system ram, find a bigger machine to test on: we also don't want to think about swap or OS cache pressure.

In this example, let's say we figured OG_used at 2000MB and a perm-gen size of 96mb. We'll start with an initially oversized new-gen of 1600MB, giving a heap of (2000+1600+96)=~3700, and the following starting set of values:

-Xmx3700m -Xms3700m -Xss256k -XX:MaxPermSize=128m -XX:PermSize=96m
-XX:NewSize=1800m -XX:MaxNewSize=1800m -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=6
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-server -XX:+AggressiveOpts -XX:+UseCompressedOops -Djava.net.preferIPv4Stack=true

Almost all the objects running through storm are short-lived -- that's what the First Rule of data stream tuning says -- so almost all your activity is in the new-gen. The new gen should gradually build to full, triggering a new-gen GC. Adjust the new-gen size so that they happen infrequently (5 or more seconds apart) and with minimal pause (<50ms).

When a new-gen GC occurs, objects copied into the survivor space should not fill it completely, and virtually no objects should survive to the second tenure. Adjust the new-gen size (larger means objects have more time to become unwanted) and the survivor ratio (lower value makes a larger survivor space) until you see that large drop-off from eden-first survivor and first-second. At this point, turn the max tenuring back to 1. (Since almost every object is handled in one new-gen GC period, only the fraction created just before a GC will hit the survivor space. So few things survive two passes that we should promote them out of the way immediately.)

Meanwhile, the old gen should average 50% utilization and increase only slowly (if at all). Your old-gen size should be the larger of a) 150-200% of its steady-state observed size, or b) slightly larger than the new-gen size. (It must be able to accommodate any object the new gen could hold). The old-gen collection should take less than 1000ms, and not occur more than once every five minutes. (It probably won't be anywhere near that often once the new gen is sized up.)

You can now try to reduce the memory footprint from this happy but overly generous state. Decrease the new-gen size until you notice either an increase in frequency or an increase in survivor occupancy. Once you do, back off by a good margin for your production value. Next reduce the overall heap size so that your old-gen occupancy has a nice sawtooth from say 40%-75%-40%-75%-... Make sure you don't reduce it too far -- the old-gen must be larger than the new-gen.

Whether we're tuning the overall flow for latency or throughput, we tune the GC for latency (low pause times). Since things like committing a batch can't proceed until the last element is received, local jitter induces global drag: it's a latency-throughput coupling. So we use the CMS low-pause collector for the old gen, and lower the initiating fraction (more frequent but shorter GCs).

Don't oversize the heap. The larger the heap, the more time a full GC takes. Even at heap sizes of 8GB GC pauses on heavily-loaded machines can cause enough jitter to start slowing down transactional portions of the flow (another latency-throughput coupling). In the worst case, an overly-large heap can take so long to GC that it times out socket connections. That causes the cluster to rebalance, which drives up load, which causes heap pressure, which causes full GCs and Christmas is ruined. I don't like running JVMs with more than 12GB heap if I can avoid it. (Also know that at 32GB your pointer size increases, taxing you a few GB)

Reply all
Reply to author
Forward
0 new messages