Consul and Sensu

Luke Chavers

unread,

Sep 16, 2014, 7:33:45 PM9/16/14

to consu...@googlegroups.com

Based on what I've read here: http://www.consul.io/intro/vs/nagios-sensu.html -- it seems that Consul's "checks" are a logical replacement for Sensu's.

However, we also use Sensu to collect metrics, which are piped to Graphite.

Does Consul address this need in any way, or, what do others typically do to address this?

Thanks,

Luke

Gavin M. Roy

unread,

Sep 16, 2014, 9:32:53 PM9/16/14

to Luke Chavers, consu...@googlegroups.com

Would be really cool if:

1) the check results are evented in 0.4 (I’ve not checked into this yet)

2) the check results could be published to the Sensu RabbitMQ server and behave like normal Sensu checks.

If the check results are evented, it should be pretty trivial to create a bridge to publish the results to Sensu via AMQP.

Cheers,

Gavin

--
You received this message because you are subscribed to the Google Groups "Consul" group.
To unsubscribe from this group and stop receiving emails from it, send an email to consul-tool...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Luke Chavers

unread,

Sep 16, 2014, 9:45:20 PM9/16/14

to Gavin M. Roy, consu...@googlegroups.com

Yes, having a way to send the check results to Consul and Sensu would be pretty cool.. but I referring more to Sensu metrics (http://sensuapp.org/docs/0.11/adding_a_metric).

I have no love for Ruby, so I'd certainly be open to Sensu replacements based in Go, Node.js, or simply Bash as well.

Thanks,
Luke

Armon Dadgar

unread,

Sep 17, 2014, 12:25:14 PM9/17/14

to Luke Chavers, Gavin M. Roy, consu...@googlegroups.com

I took a quick look at Sensu metrics, so forgive any misunderstandings. If I understand, it is used

to pulled structured metrics, pushes it down the normal AMQP path, and handlers are then free to

push to graphite / etc.

This model is not particularly well suited to Consul. With out health checks, we do a lot to ensure they

are scalable (read, avoiding writes to the servers). With metrics, there is no way to optimize away the

write and pushing ephemeral data through consensus is a bad idea, just due to the scalability limits

and how much data may be pushed.

Let me recommend instead what we do: use Consul for it’s health checking, as it is builtin and

very scalable by design, much more so than Nagios or Sensu. Then, use Consul as the discovery

component to stream metrics to an aggregation site (e.g. statsite). This is what we do internally at

HashiCorp, and it works very well for us.

Another advantage is the service discovery and health checking is decoupled from the metrics

system, so any overloading or failure there will not boil over into the other systems.

Best Regards,

Armon Dadgar

Luke Chavers

unread,

Sep 17, 2014, 3:52:23 PM9/17/14

to Armon Dadgar, Gavin M. Roy, consu...@googlegroups.com

Gavin,

Thank you very much for the thoughtful reply. I am fast becoming a HashiCorp fan-boy.. seriously..

What you said is about what I figured, but I wanted to ask to see if anyone had solved the problem in an overly clever way, or in case I had missed some fundamental point in the Consul docs.

I think for the moment, then, I should probably continue with Sensu and Consul and see how that works out. There's nothing overly special about Sensu and how it gathers its metrics and check data, it's merely a collection of shell and ruby CLI scripts. However, Sensu has a rather large collection of community contributed checks and metrics, and I would not want to reproduce them.

I only worry about the overhead hell that our VMs are gradually drifting toward. After adding Consul for checks and discovery, we'll also have Sensu for metrics (ruby), Puppet for config management (also ruby), Logstash for logs (Java), Log.io for quick audits (node), and Atlasboard for maintenance reports (also node). Not to mention the services that do not cause additional overhead, like RunDeck for orchestration, Terraform for provisioning, the Foreman for various maintenance, Packer for pre-building images, Strider for C.I., and GitLab for the repo management (+Vagrant, Task Manager, Support System, VirtualBox, VMware, AWS). Having a background in systems development I think about creating a universal VM agent just about every day.. but no time.

My co-workers are constantly raising eyebrows at me, ha ha, for good reason. Any advice in this direction is welcomed, for sure.

Anyway, Consul looks great, and I cannot wait to start tampering with it. Keep up the good work.

-Luke

Tuukka Mustonen

unread,

Sep 18, 2014, 2:43:32 AM9/18/14

to consu...@googlegroups.com

Luke, I'm in the same boat, with the exact same worries. But you forgot InfluxDB + Grafana to actually visualize the metrics and Elasticsearch + Kibana to visualize the logs :)

Considering how many moving pieces Sensu setup has, it would feel rather awkward to use Sensu just for collecting metrics. I tried pushing collectd metrics to InfluxDB (through Graphite/Carbon compatibility interface) and it worked just ok. I don't have real experience from collectd, but I guess it is pretty configurable, having been around for so long (I have no real experience). But it also feels ancient, not modern. So, I find myself wondering what is the "modern way of collecting (and visualizing) metrics"?

Armon, what do you mean by "use Consul as the discovery component to stream metrics to an aggregation site (e.g. statsite)" ? Do you actually collect the metrics (CPU, memory, disk utilization, etc.) through Consul and push them to statsite/statsd?

Also, when pushing the metrics to statsd/statsite, does Consul send them directly from each node or through master(s)? If each node pushed metrics directly to their sink, wouldn't that circumvent the issue of "With metrics, there is no way to optimize away the write and pushing ephemeral data through consensus is a bad idea, just due to the scalability limits and how much data may be pushed."

The documentation doesn't seem to give any hints on how to customize the telemetry collection?

Tuukka

(New to both Sensu and Consul so bear with me)

Armon Dadgar

unread,

Sep 18, 2014, 12:56:20 PM9/18/14

to Tuukka Mustonen, consu...@googlegroups.com

There has been an explosion in the number of tools and agents we have now, but on the other hand

remember what life was like without all of them :)

Luke, to clarify what I mean, we just do a standard Consul deploy, using it as the authoritative

DNS server for the “.consul” TLD. We deploy our graphite and statsite services and expose them

through Consul, with appropriate health checks.

Then we simply configure all our internal services to stream to “statsite.service.consul:8125”. In this

sense, we are letting Consul manage the discovery + health checking over DNS, but then all the individual

services and agents are able to stream directly to the statsite service without communicating via

Consul. (e.g. Consul is not responsible for transporting the data in the same way Sensu is)

The other metrics are collected via collectd and pushed to the “carbon.service.consul”. So same deal

there, we decouple the discovery from the metrics shipping itself.

Lastly, if you provide the `statsite_addr` or `statsd_addr` config to Consul, it does send directly

from each node to that address. I think I see the confusion here though. Consul can send metrics

that it generates internally to statsd/statsite, but it doesn’t allow other applications to write their own

metrics to Consul. That said, it certainly is an interesting idea to support that, since Consul already

speaks the protocol.

I hope that clarifies things! If not, I’m happy to answer any questions.

Best Regards,

Armon Dadgar

Luke Chavers

unread,

Sep 22, 2014, 7:03:10 PM9/22/14

to consu...@googlegroups.com

Armon,

I think I understand where you're coming from, and I am pretty sure that's pretty much in line with where I started, which is basically orienting Consul as an assistant to Puppet for service discovery, then a replacement or augmentation for various availabilty tools like KeepAliveD.

Also, I'm with ya on your statement about agents. I love each of the tools I've listed, and can certainly remember the days before them. I just worry about the amount of overhead I'm adding when every concept requires its own daemon. Many of the smart people around me are concerned about that even more, to the point that I always have a lot of explaining to do when I want to add the 5th or 6th agent to every VM in our inventory.

I suppose that I could unify all of the metrics, checks, and logs by directing all output into the logging stack. e.g. I could redirect, or duplicate, Sensu's STDOUT into a unix socket, and have the local logstash agent read from it. Same for Consul. Then, at my main log processing server/cluster I could pull the metric and check data back out using the logstash output plugins for statsd, rabbitmq, and etc. None of this addresses the agent overload issue, but at least it makes things uniform, but idk.

Another member of this group emailed me privately, and put forth his own strategy of exposing metrics, checks, and logs using SNMP.. thus eliminating the need for most (if not all) of the agents. Seems like a very cool idea.

Anyway, I appreciate the replies, I certainly have plenty of info to consider.

Thanks,

Luke

sylvain boily

unread,

Oct 1, 2014, 2:53:07 PM10/1/14

to consu...@googlegroups.com

Hello I'm new to the Consul World and was looking at the comparison with Sensu.

There are many aspect of Consul that we like when comparing to Sensu, but there is one part of sensu beside the metrics that I was looking for and it is the Handlers aspect.

We are interested in performing different handlers beside re-routing traffic. (integration with service discovery). Is this something that the Consul team is looking into ?

eg: Health check #1 fails, restart service

Health check #2 fails, generate jstack + force restart

Health check #3 fails, leave service as is, query Logstash, query grafana, generate failure report, send email.

Will consul be supporting custom handler ?

Should we look into a mix of Consul / Sensu like it was proposed in this thread to achieved this ?

Armon Dadgar

unread,

Oct 1, 2014, 11:42:52 PM10/1/14

to sylvain boily, consu...@googlegroups.com

I’m not exactly sure what you mean by this. If you mean to simply take actions based on

health checks, take a look at “watches”:

http://www.consul.io/docs/agent/watches.html

http://www.consul.io/docs/commands/watch.html

They let you hook custom handlers in to various events that happen in Consul very easily.

Best Regards,

Armon Dadgar

From: sylvain boily <sylvai...@gmail.com>
Reply: sylvain boily <sylvai...@gmail.com>>
Date: October 1, 2014 at 11:53:08 AM
To: consu...@googlegroups.com <consu...@googlegroups.com>>
Subject: Re: Consul and Sensu

--

Ingo Oeser

unread,

Oct 2, 2014, 3:18:03 AM10/2/14

to consu...@googlegroups.com

What is required for implementing advanced monitoring features to suppress false positives are just 2 things per service:

How long has this service been in this state already?

Since how many consecutive checks did it reveal the same result? (This could be capped to a number below 100 to avoid overflows and still be useful)

If you feel diligent, you can add "is it currently considered flapping?" flag too. This flag should be set by the check itself.

Now rate based and time based flap detection is possible in the checks.

Also increasingly heavy countermeasures again based on age and count of state changes are possible.

While working with a lot of monitoring tools, I noticed this behaviour needs to be controlled by the checks, because there are too many special cases and the countermeasures are local anyway.

Only notification is actually a global feature. I would recommend doing that on a few less exposed machines with light load.

Reasons: Notification often requires credentials and will likely not work locally anyway or might get heavily delayed, if many checks fail. This leads to a wave of notifications, after the problem is actually resolved in case of locally queued notifications or lost notifications in case of unqueued notifications for unhealthy nodes. One will lead to ignorance (alert fatigue) the other one misses the point of notification on failure.

David Johnsson

unread,

Dec 3, 2014, 7:49:54 PM12/3/14

to consu...@googlegroups.com

Hi Armon

Thanks for your tips here.

With regards to "The other metrics are collected via collectd and pushed to the “carbon.service.consul”. So same deal there, we decouple the discovery from the metrics shipping itself."

How do you decouple the discovery of service endpoints within collectd config? I would really like to be able to use consul to return our container services (+ monitoring ports), feed the info into collectd and then have collectd collect+ship the metrics.

It seems we need to statically define the services and ports in the collectd config at startup. This is not great because it means our monitoring is not dynamic.

Thanks
Dave

Armon Dadgar

unread,

Dec 3, 2014, 9:06:36 PM12/3/14

to consu...@googlegroups.com, David Johnsson

Hey Dave,

You would statically configure the endpoint to something like “carbon.service.consul”. That part is static

yes, but the resolution of that name into an actual carbon endpoint is done dynamically by Consul over

DNS.

Does that make sense? Maybe I misunderstood the question.

Best Regards,

Armon Dadgar

From: David Johnsson <dbjoh...@gmail.com>
Reply: David Johnsson <dbjoh...@gmail.com>>
Date: December 3, 2014 at 4:49:55 PM
To: consu...@googlegroups.com <consu...@googlegroups.com>>
Subject: Re: Consul and Sensu

David Johnsson

unread,

Dec 3, 2014, 10:38:59 PM12/3/14

to consu...@googlegroups.com

Ok Great. So we use the consul urls in our collectd config and consul will do the routing.

Thanks

Reply all

Reply to author

Forward