Metrics from Consul

Darron Froese

unread,

Nov 28, 2014, 11:41:37 PM11/28/14

to consu...@googlegroups.com

We're testing a small Consul cluster (~100 nodes - 3 servers) before we roll it into production and are looking at the stats that are being sent using the statsd integration.

We're trying to make sure we understand what the metrics mean - so that as we experiment we have a better chance at understanding what's actually going on.

We've been able to view leadership transitions (through consul.serf.events.consul_new_leader), watch raft and serf messages be propagated and see the cluster state change as it's been rebooted.

I have looked through some of the Go around the metrics and have some questions - but are there any documents already written that describe some important metrics to watch?

The code is well documented and seems easy to understand (even though I don't speak much Go at the moment) - but I was wondering:

1. What metrics do you at Hashicorp keep a close eye on / alert on to make sure things are running smoothly?

2. It looks like your MeasureSince method is in msecs - am I reading that right?

3. Is there a count anywhere of how many nodes a particular server thinks that exist at the moment? I haven't seen one in the source or the metrics that have been emitted.

4. Do we know when everybody *has* committed a particular change? Is there a latency value around the KV data or other changes?

I know that monitoring a distributed system is different than monitoring something much more static - so please forgive me if I'm not asking the right questions or barking up the wrong tree.

Thanks for a great piece of software, we're looking forward to rolling this out and seeing it mature.

Armon Dadgar

unread,

Nov 29, 2014, 3:29:08 PM11/29/14

to consu...@googlegroups.com, Darron Froese

Hey Darron,

Many of the metrics aren’t immediately useful, and are mostly there to assist with debugging.

The key ones I would watch:

* consul.raft.commitTime - The number of write transactions and associated latency

* consul.serf.queue.Event - The backlog of events in the queue, good to catch a bad client from flooding

* The ACL metrics (cache_hits, cache_miss, fault, resolveToken) are useful if ACLs are enabled

1) Ultimately however, we do much higher level monitoring of health. We treat Consul as a black box.

Just attempting a periodic read/write of a key is enough for our monitoring of the cluster.

2) MeasureSince is in msec yes.

3) No, there isn’t a metric for this. You can always query the nodes to ask them, but

it’s not generally an actionable metrics.

4) Not really, since a commit only considers that a quorum of nodes agree. It could be indefinite for

all the peers to commit a change (assuming 1/3 servers has failed).

Hope that helps!

Best Regards,

Armon Dadgar

--
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Darron Froese

unread,

Dec 1, 2014, 1:07:04 AM12/1/14

to Armon Dadgar, consu...@googlegroups.com

Cool - thanks for the feedback Armon - we'll keep an eye on the metrics you indicated as we roll this out.

Thanks again!

Reply all

Reply to author

Forward