Health monitoring

Kelly Sommers

unread,

Sep 18, 2012, 9:43:25 AM9/18/12

to distsys...@googlegroups.com

Hey all,

We have a mixed platform deployment and we need something to handle collecting error and tracing logs as well as giving us some kind of health status of various parts of our system.

It needs to work with monitoring applications running on Windows and Linux and also keeping tabs on our Cassandra cluster among other things. If it doesn't cover all that, it needs to have an easy to consume API so we can hook in things it may not support easily.

Does anyone suggest anything and can you describe your experience and what you use it for? I've heard of Splunk but I haven't seen much about it yet.

Michael Rose

unread,

Sep 18, 2012, 9:49:49 AM9/18/12

to distsys...@googlegroups.com

Splunk is good, though be prepared to pay for it.

I've yet to find anything that's a great comprehensive dashboard for non-homogenous systems. Loggly + NewRelic is pretty great for smaller, more typical stacks.

At this point, I've been rolling my own dashboard taking data from Graphite, Nagios, syslog. I used to use OpenTSDB but ended up phasing that out in favor of Graphite.

If you find anything that's a nice framework for creating dashboards (Geckoboard doesn't count), please post it.

OpsCenter is alright for keeping an eye on Cassandra.

Perhaps it's telling, but we have a six-monitor workstation whose sole job is to graph various stats from Graphite.

--

Michael Rose (@Xorlev)

Senior Backend Engineer, FullContact
mic...@fullcontact.com

--
You received this message because you are subscribed to the Google Groups "Distributed Systems" group.
To post to this group, send email to distsys...@googlegroups.com.
To unsubscribe from this group, send email to distsys-discu...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Kelly Sommers

unread,

Sep 18, 2012, 9:58:28 AM9/18/12

to distsys...@googlegroups.com

Thanks for the swift reply Michael!

I'm having difficulty finding out what Splunk supports.

Do you know if it has an API and what its like to integrate with? I'm guessing since our cluster is diverse we might have to hand roll some code to put data into Splunk.

Michael Rose

unread,

Sep 18, 2012, 10:01:21 AM9/18/12

to distsys...@googlegroups.com

http://dev.splunk.com/view/sdks-apis/SP-CAAADP7

They have a REST API + wrappers around that API. I've never integrated with them, however.

--

Michael Rose (@Xorlev)

Senior Backend Engineer, FullContact
mic...@fullcontact.com

Mikhail Panchenko

unread,

Sep 18, 2012, 11:10:40 AM9/18/12

to distsys...@googlegroups.com

Might check into Librato Metrics https://metrics.librato.com/metrics - HTTP API for collecting metrics + lots of existing OSS libraries http://support.metrics.librato.com/knowledgebase/articles/24205-custom-and-contributed-data-collectors (collectd plugin etc, though not sure how much there is in the way of Windows support)

They don't do log parsing or aggregation for you, but they do have threshold alerts and their dashboards are glorious and easily embeddable. I believe they recently added "annotation streams" http://dev.librato.com/v1/annotations (handy for marking things like Cass repairs etc). Their API is pretty straight forward, so you should be able to hook whatever you need into it.

DISCLAIMER: I worked on the java libs for Librato, and they are good friends.

Kevin Smith

unread,

Sep 18, 2012, 11:16:51 AM9/18/12

to distsys...@googlegroups.com

We rely on a combination of Splunk, graphite, estatsd, and some homegrown tooling to monitor our production systems. If your application logs to disk then getting that data into Splunk is pretty simple. Splunk comes with a bunch of log format parsers to simplify log data ingestion. Some of our components use their own log file format so we had to teach Splunk how to parse their logs. It turned out to be fairly hassle-free. This does mean you have to run Splunk's java agent on the target servers which is something to consider.

If you can afford Splunk it's probably worth it. If I was starting from scratch I'd also take a look at logstash. I don't have any experience with it but the project seems to be getting some traction and merits a glance at least.

--Kevin

Message has been deleted

Nate McCall

unread,

Sep 18, 2012, 12:21:45 PM9/18/12

to distsys...@googlegroups.com

Agree on the Splunk and graphite combination (particularly the price
part of splunk - I hope you have a good budget, but it does just kind
of work).

Be ready to add federation immediately out of the box with graphite
with even a moderate number of machines though. It's a great tool, but
requires a lot of IO availability as it's storage plumbing is not the
best with out-of-order data.

Loggly is a good starting point for smaller apps and very reasonably
priced, but I was frequently disappointed by how cumbersome their
search interface was.

In general, OpsCenter is excellent for purpose-built monitoring and
admin of a Cassandra cluster (full disclosure I used to work for
DataStax and am friends w/ all those folks). That said, I'd love to
spend some time porting the column family schema from OpsCenter (very
well optimized time series collection - take this apart if you want to
know how to do this correctly in cassandra) to be a whisper
replacement.

My only beef with OpsCenter is that, though quite efficient, the
'store thy own metrics data' approach means there is still some load
to deal with which may exacerbate a bad situation if you anticipate
traffic spikes (or you are nervous about your capacity planning models
for Cassandra :)

Michael Rose

unread,

Sep 19, 2012, 1:54:47 AM9/19/12

to distsys...@googlegroups.com

Hi Nick,

It wasn't anything to do with OpenTSDB its self — it worked rather well. We moved away from HBase as a technology so OpenTSDB went with it.

OpenTSDB was easy to setup and use, especially if you have an already existing HBase cluster. For us personally, we didn't have enough metrics to push Graphite/whisper out of the running.

Michael

--

Michael Rose (@Xorlev)

Senior Backend Engineer, FullContact
mic...@fullcontact.com

On Tuesday, September 18, 2012 at 9:33 AM, Nick Telford wrote:

On Tuesday, 18 September 2012 14:49:52 UTC+1, Michael Rose wrote:
At this point, I've been rolling my own dashboard taking data from Graphite, Nagios, syslog. I used to use OpenTSDB but ended up phasing that out in favor of Graphite.

I'm interested in why you decided to move away from OpenTSDB to Graphite? We collect a ton of metrics and Graphite is starting to show its storage limitations. I've been considering OpenTSDB to replace the backend. I'd love to hear more on your experiences here.

Regards,
--
Nick Telford

--
You received this message because you are subscribed to the Google Groups "Distributed Systems" group.
To post to this group, send email to distsys...@googlegroups.com.
To unsubscribe from this group, send email to distsys-discu...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msg/distsys-discuss/-/wdMlo1Pm8SYJ.

Bryan Hunter

unread,

Mar 10, 2013, 3:58:58 PM3/10/13

to distsys...@googlegroups.com

You might also take a look at Boundary. Here's a nice summary page (http://boundary.com/ways/). Here are customer case studies: http://boundary.com/company/customers/#github

I've kept an eye on them because they use Erlang, and well... because they're badass: http://talkincloud.com/public-cloud/amazon-cloud-crash-monitoring-firm-saw-clues-early

-Bryan

Reply all

Reply to author

Forward