Large heap usage

jbezza

unread,

Mar 25, 2015, 6:46:32 PM3/25/15

to metric...@googlegroups.com

Hello,

I am seeing large heap usage coming from codahale metrics while debugging an issue in Apache Cassandra (Cassandra uses metrics-core-2.2.0).

After taking a heap dump, I see around 1GB of memory (out of a 4GB heap) occupied by java.util.concurrent.ConcurrentSkipListMap$Node objects. From what I can see in the code this comes from the ExponentiallyDecayingSample object which is used as part of the Histogram. Cassandra uses Histogram objects a lot but 1GB seems excessive.

I discovered this while debugging an issue where I see large-ish GC pauses when my Jmx agent is bounced. The Jmx agent connects to Cassandra and scraps the codahale metrics for analytics, it should have zero impact on the performance of Cassandra when bounced.

Is the ExponentiallyDecayingSample object expected to fill up over time? Is there a potential memory leak where long running jmx connections hold onto these objects and only release for garbage collection when the jmx connection terminates. Or perhaps this is a Cassandra bug?

Any help greatly appreciated.

Jim.

Marshall Pierce

unread,

Mar 25, 2015, 11:55:13 PM3/25/15

to metric...@googlegroups.com

A few observations:

- That’s a quite old version of metrics. Cassandra should upgrade.

- The snapshot API used between reservoirs and reporters is unfortunately very allocation heavy (that’s one of the things I’ve been thinking about how to address in metrics 4), but even so, 1GiB seems excessive, so something is probably not right.

- Is Cassandra configuring its exponential decay reservoirs with large sizes? The default (at least in metrics 3) is 1028, which is not crazy big. How many reservoirs (vs samples) exist in your heap dump?

- ExponentiallyDecayingReservoir (as it’s called in metrics 3 at least) is both incorrect for non-normal distributions, which includes most things you’d probably want to measure with metrics, and allocation-heavy even for a reservoir. In some world where Cassandra uses metrics 3, use https://bitbucket.org/marshallpierce/hdrhistogram-metrics-reservoir instead, assuming you’re measuring things like latency. See that url for more on why.

-Marshall

> --
> You received this message because you are subscribed to the Google Groups "metrics-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to metrics-user...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

jbezza

unread,

Mar 26, 2015, 11:38:39 AM3/26/15

to metric...@googlegroups.com

Hi Marshall,

Thanks for the quick response and interesting observations. Yes, debugging further it seems that over a period of a few days Cassandra is creating a whole load of tenured garbage, which isn't freed up until concurrent-mark-sweep kicks in when the heap runs low on space. 1GB was the worst I have seen, the other hosts were in the ~100mb range.

I'll take up your points with the Cassandra community to see if we can improve things.