On Mon, Jun 4, 2012 at 10:52 PM, Leigh Heyman <
leigh....@gmail.com> wrote:
> I'll chime in here also a little on the late (and long) side.
>
> Like several others on this thread I've been through the cycle of trying to
> find something "better" only to wind up back with nagios. While we would
> occasionally find tools that had a feature or two that worked better, none
> ever really seemed better enough to justify investing in the work of
> migrating a fairly complex monitoring footprint (if yours wasn't you
> probably wouldn't be seeking an alternative).
>
> I constantly sought (what I thought was) the holy grail: a system that could
> combine both alerting and trending in one. But I came to believe that
> thinking is inherently flawed.
>
> I came instead to the maxim that your alerting system should have a very
> high barrier of entry for new checks, while your trending/graphing system
> should be as easy as possible to add new data.
>
While I agree with your general theme, I disagree with the
high-barrier on the check side. Discouraging checks means discouraging
monitoring means discouraging knowing when shit blows up. Check
accessibility and trending accessibility are orthogonal.
> Put simply, the objectives of the two tools, are in opposing tension. You
> don't want to be paged for every little thing (e.g. I don't need a page
> about the site being down, if I just got paged that the db behind the site
> is down, right?), yet likewise, you don't want to be stuck unable to figure
> out what's wrong, because you weren't graphing the one thing that would help
> you.
>
The thing is registering an intent with your check system needs to be
something automate-able (ideally over an API). I despise the idea that
I have to BOUNCE Nagios to add a new check or a new host. I have to
STOP monitoring to add new monitors? In the most colorful language
possible, how fucked is that?
> There's an entropy problem in the signal to noise ratio, which I realized
> was at the core of why I kept wanting to abandon nagios. We'd go through
> these cycles where the SnR for nagois would slowly get unworkable, we'd
> decide nagios sucked, spend a ton of time trying trying to figure out a
> better tool, then eventually just decide instead to take a hatchet to our
> nagios configs, write some more scripts wrestle the SnR under control, then
> rinse and repeat a few months later. Finally, after one of these iterations,
> we realized it wasn't the tool, but how we were using it. One of the guys
> finally made a rule, "all alerts must be actionable." From then on any new
> check was subject to a team review where the person proposing the new check
> had to explain to the rest of the team what the action was if it triggered.
>
This is a soft skill and classification issue though. You should never
be discouraged for adding visbility in to your infra. On the flip
side, not all checks are equal. If something isn't actionable, you
shouldn't be alerting on it. Get rid of any "known errors". Those are
killer. Just last week I didn't listen to my own advice and ignored an
alarm because I misread it.
We also need to get out of this binary model of good vs. bad. We need
intelligence in our monitoring. Yeah I might have lost a riak node
overnight for some reason but that's okay because the cluster is 5
nodes. Now if I lose another, I'm going to start getting concerned.
Are they in the same rack? Maybe they're in the same AZ and it's
having flakiness. Maybe it's just a spurious error. Does it always
happen at the same time? We can bake some of this intelligence in but
not with something like Nagios. (well maybe you can. An event handler
that queries some sort of CEP or something to determine the "color".
And whatever system does that needs to handle user-defined nuances.
Somewhere John Allspaw is having trouble sleeping because I just
suggested more automation ;)
> Once we accepted that mindset, we broke out of the signal to noise cycle,
> and we're able to spend our time improving our overall ecosystem instead of
> wasting time re-evaluating and re-engineering our monitoring and graphing
> every few months.
>
I've got a new world view that I want to live in. One where
applications and hosts, by virtue of configuration management register
their intent with a monitoring system. They just come online and start
pushing metrics and checking in with the monitoring system. If they
don't check in in some sort of reasonable interval, the monitoring
system goes into "yo dude. you okay over there?" mode on a well known
port (i.e. zeromq agent on all machines). Monitors could even discover
new nodes more intelligently than port scanning a subnet and crashing
print servers. New nodes register presence akin to XMPP.
Imagine an IRC channel with all your servers and your monitoring host
with channel ops. Many folks are doing something like this with
Zookeeper as well.
We need to start leveraging discoveries and ideas that have come up
since Netsaint was written. Things like service discovery (DNS TXT/SRV
records anyone?).
Apologies if this sounded like a directed attack at you. It most
assuredly wasn't. Monitoring in general is something that's been
bugging me for a while.
> -L
>
>
> On Monday, June 4, 2012, Stacey Schneider wrote:
>>
>> It does depend a lot on what you are trying to monitor for the style of
>> monitoring you are looking at. Nagios is a solid "toolbelt" option that most
>> admins start with. Hyperic (open source) did a upgrade path from Nagios so
>> you can include all existing monitoring there, but then you can also do
>> easier management of managed resources, auto-discovery, and a host of
>> graphing, alerting and cross-language/platform features that typically take
>> a bit of work in Nagios to deploy. Check it out:
>>
>> Product page:
http://www.hyperic.com/products/nagios-monitoring
>> Plugin details:
http://support.hyperic.com/display/hyperforge/Nagios
>> Hyperic Nagios
>> documentation:
http://support.hyperic.com/display/DOCS46/Nagios
--
John E. Vincent
http://about.me/lusis