Gauge metric type - StatsD integration

1,086 views
Skip to first unread message

Andrzej Dębski

unread,
Sep 20, 2014, 5:02:45 PM9/20/14
to kamon...@googlegroups.com
Hello

At our project we are using Kamon and we wanted to gather some domain specific metrics using Kamon user metrics feature.

We are using StatsD/Graphite/Grafana stack to visualize data.

Today I was digging around source code of Kamon and StatsD to see how are the basics implemented.

I can see that in the package kamon.metric.instrument you have 4 types of metrics
  1. Counters
  2. Gauges
  3. Histograms
  4. MinMaxCounters
StatsD supports 4 types also
  1. Counters
  2. Timers
  3. Gauges
  4. Sets
Now in class StatsDMetricsSender I can see 

  def writeMetricsToRemote(tick: TickMetricSnapshot, udpSender: ActorRef): Unit = {
    val packetBuilder = new MetricDataPacketBuilder(maxPacketSizeInBytes, udpSender, remote)

    for (
      (groupIdentity, groupSnapshot) ← tick.metrics;
      (metricIdentity, metricSnapshot) ← groupSnapshot.metrics
    ) {

      val key = metricKeyGenerator.generateKey(groupIdentity, metricIdentity)

      metricSnapshot match {
        case hs: Histogram.Snapshot ⇒
          hs.recordsIterator.foreach { record ⇒
            packetBuilder.appendMeasurement(key, encodeStatsDTimer(record.level, record.count))
          }

        case cs: Counter.Snapshot ⇒
          packetBuilder.appendMeasurement(key, encodeStatsDCounter(cs.count))
      }
    }

    packetBuilder.flush()
  }

  def encodeStatsDTimer(level: Long, count: Long): String = {
    val samplingRate: Double = 1D / count
    level.toString + "|ms" + (if (samplingRate != 1D) "|@" + samplingRateFormat.format(samplingRate) else "")
  }

  def encodeStatsDCounter(count: Long): String = count.toString + "|c"

So the metrics with type Gauge, Histogram, MinMaxCounter are being sent as timers because their snapshots are being identified by Histogram.Snapshot type and counters are being sent as counter. I am wondering why Gauge type is not sent as StatsD gauge.

For example now when I want to monitor number of actors of some type I can use Counter metric type but I will have to update metric value all the time (because during collection the counter is reset) or use Gauge type but receive it on StatsD end as timer type.

Is this intentional? Browsing through the Kamon code the fix would have to include the creation of another snapshot type that would be dedicated to the gauges so StatsD sender would know that the message has to be formatted <metric-name><value>|g. This new snapshot type could just return newest value that was reported to the gauge and send it to StatsD.

Andrzej Dębski

unread,
Sep 21, 2014, 5:49:15 AM9/21/14
to kamon...@googlegroups.com
After some thinking I am wondering why Gauge is implemented the way it is (meaning it is backed by Histogram) - wouldn't it be sufficient to implement it using just single private variable that would represent the latest call to record function?

Are there any advantages to using Histogram as the "backend" for the gauge metric?

Ivan Topolnjak

unread,
Sep 22, 2014, 12:20:00 AM9/22/14
to kamon...@googlegroups.com
Hello Andrzej, and welcome to our community!

Nice that you ask about why things are the way they are, that's something we should have written about a long time ago! Taking your question with regards to the snapshots design:


Is this intentional? Browsing through the Kamon code the fix would have to include the creation of another snapshot type that would be dedicated to the gauges so StatsD sender would know that the message has to be formatted <metric-name><value>|g. This new snapshot type could just return newest value that was reported to the gauge and send it to StatsD.

We modeled the metrics instruments included in Kamon based on our own goals and even while they seem similar to instruments with the same names in other metrics libraries, we have a few differences.. let me first explain what each instrument means to us:

The Counter is the simplest one, it just counts and resets to zero upon each flush. Some other libraries allow counters to go up and down but we only allow them to go up, they are ideal for counting errors or occurrences of specifics events in your app but they fall short for things like mailbox sizes, see the MinMaxCounter section bellow to understand why.

Our Histogram is special and different to what all other libraries offer because thanks to the HDR Histogram [1] we can record all the measurements taken by the application with fixed time and space in a very efficient manner and with configurable precision, which by default is 1%, meaning that every measurement that you store is adjusted to one of the available buckets, but never more than 1% away from the original value.. the precision can be configured to even finer values. As you might guess, this is ideal for storing latencies, like we do for processing-time and time-in-mailbox for actors, as well as elapse-time for traces, it doesn't matter if you store one million or one billion measurements, the memory footprint remains the same.

Now we get to one created by us, the MinMaxCounter... when monitoring queues, like we do for actor mailboxes, just having a number going up and down (like a traditional counter) that is collected every X time or recording the queue size every X time (like a traditional gauge) wasn't enough for us.. if you are flushing every 10 seconds (default for StatsD and some others flush every 60 seconds) many things might have happen in those 10 seconds, you could get a million messages in the mailbox and process them all and after 10 seconds you might record 0 as the mailbox size. When monitoring the queue size it is not just about knowing where it is at a given moment because that number is probably incorrect right after reading it, but knowing where it was is of great help. The MinMaxCounter internally has 3 variables, one tracking the current value, one tracking the minimum and one tracking the maximum. These three values are read and stored in a histogram every 100 milliseconds by default and after 10 seconds you would have 300 measurements of where the queue size was that contain the lowest and highest sizes! Knowing the boundaries between which your mailbox sizes usually move is a incredibly valuable information if you try to move from unbounded to bounded mailbox implementations. After each collection, the max and min are reset to the current value, meaning that if no changes occur then the three of them will have the same value. Obviously this is the recommended instrument when recording queue-like metrics.

Finally, our Gauge is a mix between the Histogram and the MinMaxCounter: we take a measurement of a given value every 100 milliseconds by default and store the observed value in a Histogram. This can be seen as a traditional gauge that happens to report 100 values on every flush (assuming the 10 seconds example and default config) rather a single (latest) value on every flush, that's why it uses a histogram snapshot, because usually the number of measurements taken between flushes is usually larger than 1. If you flush every 10 seconds and configure the gauge to refresh every 10 seconds, you effectively turn our gauge in a traditional gauge. The uses for a gauge are diverse, we use them a lot in our system metrics module.

Now let's talk about StatsD... All of our instruments were designed without a specific metrics backend in mind and we always prefer to bring as many information as possible and let the backend summarize the data if needed.. as a example, our New Relic backend takes all the valuable data we have and reduces it to average, min, max, sum and sum of squares before posting to New Relic, because that's what they accept (sad but true).. When trying to match between our instruments and what StatsD offer it was clear that our counters would be StatsD counters and Histograms and MinMaxCounter snapshots would be StatsD timers.. we then tried to report our Gauge information as a StatsD gauge, but it turns out that StatsD will only flush the latest value of a gauge to the backend and in our 10 seconds example, only the last value sent will be reported and the previous 99 would be thrown away, not precisely what we wanted for our users valuable data, that's why for StatsD we report gaugues as timers as well. As a additional note, I always try to use the word "distribution" in my mind rather than "timer" which in my opinion is a word that could have been better selected.

All that being said, I think you should now understand why we only have counter snapshots and histogram snapshots and why they map the way they do with StatsD instruments, if you still have any doubts or suggestions please let us know, hope you find this information useful, best regards!

Andrzej Dębski

unread,
Sep 22, 2014, 5:52:40 AM9/22/14
to kamon...@googlegroups.com
Thank you very much for a lengthy response!

Firstly may I suggest copying it to Kamon documentation, even in the current form? I think this is great insight that may be valuable to newcomers.

I can clearly see now what you wanted to achieve with Gauge implementation backed by HdrHistogram. I agree that for the very "dynamic" values it is very valuable information and reporting something as that as StatsD gauge would result in information loss.

My use case is a bit different. My idea was to leverage existing infrastructure that Kamon already gives you by using UserMetrics to report some domain specific metrics (as I stated in first post). My metrics are much more "static" - for now I want to somehow have the information how many nodes there are in my graph database - so this value after some time reaches steady level. For this the last value would be sufficient. Those metrics would be pushed to StatsD (not all of them) but all of them would be available using JMX (I have written very simple Kamon JMX listener that subscribes to user metrics and creates very simple MXBean with an array of metrics).  

Currently I registered Kamon gauges and I am calculating average value of histogram that is contained in TickMetricSnapshot message for raporting to JMX and first tests showed that it looks ok - after flush after last write the average value is correct value because I am only reporting single value to the histogram.

After looking at the code of MinMaxCounter using it would be perfect for me (it has current value) but there is one problem. There are ways to know what type of metric the Histogram snapshot was created for (by looking at the metric/metric-group name) but when I receive Histogram.Snapshot I do not know what value in histogram was the latest. I was thinking of adding richer interface to snapshots that are created by different metrics.

For example MinMaxCounter would create MinMaxSnapshot that would have methods for extracting min, max, current value that was stored when collect method was called but additionally it would retain old Histogram.Snapshot interface (MinMaxSnapshot would extend Histogram.Snapshot) so the old code would still work as designed but it would be possible if someone would want to extract "characteristic" values. 

Other idea is to create something like SimpleGauge (or some other mechanism that would make it possible to mark if the gauge should be dynamic - backed by HdrHistogram or static - backed by single value) that would only retain current value and snapshot would also include only this one. Also the SimpleGauge would not need to create the closure over variable and resend it to backing histogram like it is done currently. 

Ivan Topolnjak

unread,
Sep 25, 2014, 9:49:52 AM9/25/14
to kamon...@googlegroups.com
Andrzej,

a couple comments:


Currently I registered Kamon gauges and I am calculating average value of histogram that is contained in TickMetricSnapshot message for raporting to JMX and first tests showed that it looks ok - after flush after last write the average value is correct value because I am only reporting single value to the histogram.

That's nice! Another user mentioned that JMX support would be a nice thing, if you feel like sharing this with the community that would be awesome! :).

After looking at the code of MinMaxCounter using it would be perfect for me (it has current value) but there is one problem. There are ways to know what type of metric the Histogram snapshot was created for (by looking at the metric/metric-group name) but when I receive Histogram.Snapshot I do not know what value in histogram was the latest. I was thinking of adding richer interface to snapshots that are created by different metrics.

There will be a richer API for working with Histogram.Snapshot, keep an eye on [1].. With regards to knowing the "last" value of the gauge, there is no need to know the last value if you only have one, right?.. I would recommend you to configure your gauge such that the refresh-interval matches with the tick-interval, you might keep your code as it is, calculating the average just in case, but typically it will be just a single value to work with. In code, it might look like:

somewhere in your application.conf:

my-gauge-config {
  highest-trackable-value = 999999999
  significant-value-digits = 2
  refresh-interval = ${kamon.metrics.tick-interval}
}

and then in your code:

val gauge = Gauge.fromConfig(yourGlobalConfig.getConfig("my-gauge-config", system) { ..... }

I hope that it helps you, regards!

Andrzej Dębski

unread,
Oct 5, 2014, 11:46:38 AM10/5/14
to kamon...@googlegroups.com
Hey

That's nice! Another user mentioned that JMX support would be a nice thing, if you feel like sharing this with the community that would be awesome! :).

I was thinking about it but before I do this I need to change it a bit: add configuration and also subscribe to not only UserMetrics.

There will be a richer API for working with Histogram.Snapshot, keep an eye on [1].. With regards to knowing the "last" value of the gauge, there is no need to know the last value if you only have one, right?.. I would recommend you to configure your gauge such that the refresh-interval matches with the tick-interval, you might keep your code as it is, calculating the average just in case, but typically it will be just a single value to work with. In code, it might look like:

Good idea - thanks.
Reply all
Reply to author
Forward
0 new messages