Monitoring solution

3,478 views
Skip to first unread message

Jens Braeuer

unread,
May 31, 2012, 2:58:36 AM5/31/12
to devops-t...@googlegroups.com
Hi everyone,

I'd like to get your recommendation/pointers/experience with/to
monitoring solutions. I run Nagios/Icinga for some time, but somehow it
feel Nagios time is over. While I like the Puppet's Nagios features, I
want to avoid exported using resources. The infrastructure to be
monitored is Linux only, automated with Puppet/MCollective and to some
degree dynamic (autoscaling). Performance data is logged via Graphite.

Out of my head, these alternatives come to my mind:
- Sensu:
http://joemiller.me/2012/01/19/getting-started-with-the-sensu-monitoring-framework/
- Zabbix: http://www.zabbix.com/

What state-of-the-art monitoring solution would you recommend? It does
not have to be a all-in-one solution, I am fine with putting together
pieces (to some degree). Or stick with the proven tool (Icinga)?

Thanks for your opinions,
Jens

Guillaume FORTAINE

unread,
May 31, 2012, 5:50:33 AM5/31/12
to devops-t...@googlegroups.com
> What state-of-the-art monitoring solution would you recommend?

http://www.shinken-monitoring.org/

Will Thames

unread,
May 31, 2012, 5:57:31 AM5/31/12
to devops-t...@googlegroups.com
Shinken looks pretty impressive from the website

Have you actually used it? Does it scale as well as the claims? Is it definitely better than other things built on Nagios (e.g. opsview).

I'd really like to know in what environment you've seen it perform, and what you've compared it against.

Thanks,
Will

Karanbir Singh

unread,
May 31, 2012, 6:04:32 AM5/31/12
to devops-t...@googlegroups.com
Hi,

On 05/31/2012 07:58 AM, Jens Braeuer wrote:
> Out of my head, these alternatives come to my mind:
> - Sensu:
> http://joemiller.me/2012/01/19/getting-started-with-the-sensu-monitoring-framework/
> - Zabbix: http://www.zabbix.com/

Both of those are good solutions for their own target audiences. I've
used zabbix, a lot, and it works fairly well. The API is decent and you
can almost get away without having to use the web-ui to add and manage
resources.

I've also been looking at ExtreMon for high frequency monitoring - in
the region of a few hundred resources monitored at between 90 to 300
samples per second. Have not got it up and working completely, so dont
have an opinion on it yet; but its looking good. Specially like the fact
that the svg used to render display can be generated from code and
adapted to handle changing situations, again from within code.

--
Karanbir Singh
+44-207-0999389 | http://www.karan.org/ | twitter.com/kbsingh
ICQ: 2522219 | Yahoo IM: z00dax | Gtalk: z00dax
GnuPG Key : http://www.karan.org/publickey.asc

David Workman

unread,
May 31, 2012, 6:18:40 AM5/31/12
to devops-t...@googlegroups.com
If you're open to hosted solutions, Newrelic is pretty good, and has free server monitoring (application-level metrics cost, and can be quite expensive)

Bryan Berry

unread,
May 31, 2012, 7:28:29 AM5/31/12
to devops-t...@googlegroups.com
I have not used sensu myself but I met a number of people at Chefconf that were using it in production environments and happy with it. 

Vladimir Vuksan

unread,
May 31, 2012, 7:33:51 AM5/31/12
to devops-t...@googlegroups.com
Trouble here is that this is too vague to answer :-) ie. besides the
perception of being outdated what issues do you see with Nagios/Icinga.

Zabbix for example is nice however it requires a fairly hefty machine as
it uses a SQL database for it's storage.

Vladimir

Eric Anderson

unread,
May 31, 2012, 7:43:44 AM5/31/12
to devops-t...@googlegroups.com
Another hosted option is CopperEgg.  Works great on Linux, has process monitoring, alerting, etc.  

Eric

Thomas Vincent

unread,
May 31, 2012, 10:48:36 AM5/31/12
to devops-t...@googlegroups.com
It really depends on your requirements. Do you have a lot of devices? Do you have have a NOC? Do you have a lot of different devices where a strong community that has produced a lot of plugins would be helpful? Do you need nagios compatibility for any custom plugins you have written?

Smaller shops Nagios and Zabbix works. But IMHO,  they don't scale.  I use Zenoss because of its flexibility and it can scale and has to 10's of thousands of devices. It has a rich API (great Chef integration, good Puppet integration), supports Nagios, and Cacti plugins. I can produce plugins very quickly for the new apps that are constantly coming down the pipe for me to monitor.  Zenoss 4.1.1 uses RabbitMQ, Memcache, and MySQL for its infrastructure. That may be to much for what your needs are. 

Personally, being able to use a Nagios plugin within any new monitoring system would be a requirement for me. I would not want to give up the rich community of plugins that Nagios has. 

Tom
--
Cheers,
Tom

Nicholas Tang

unread,
May 31, 2012, 11:04:47 AM5/31/12
to devops-t...@googlegroups.com
Another 'nother option is Reconnoiter: http://labs.omniti.com/labs/reconnoiter

It's not widely used, but is pretty nifty from what I've seen.  (I haven't run it in production.)  If I was starting over again, I'd definitely consider it as an option - Theo and the crew at Omniti generally know how to build good apps and understand the pain involved in running operations well.

Circonus is a commercial monitoring platform from Omniti: http://circonus.com/ (I believe it's basically just a commercial version of Reconnoiter, but I don't actually know this, it's just a guess.)

Nicholas

Vladimir Vuksan

unread,
May 31, 2012, 11:18:50 AM5/31/12
to devops-t...@googlegroups.com
Reconnoiter doesn't do alerting IIRC.

Vladimir

Scott Smith

unread,
May 31, 2012, 3:15:05 PM5/31/12
to devops-t...@googlegroups.com
I make very heavy use of Sensu here at DISQUS. I replace our old
Nagios instance with it. It is quite capable for active and passive
system and service level monitoring.

In conjunction with my graphite check script
(http://github.com/disqus/nagios-plugins) and porkchop
(http://github.com/disqus/porkchop) sensu makes for the perfect
monitoring system.

Scott Smith

unread,
May 31, 2012, 3:18:48 PM5/31/12
to devops-t...@googlegroups.com
At my previous job, Nagios server load avg once hit 15k due to nsca
forkbomb. #winning

JP Schneider

unread,
May 31, 2012, 3:21:05 PM5/31/12
to devops-t...@googlegroups.com
Have to give a hat tip to using Opsview(or Nagios) along with check-mc-nrpe (assuming you are using Mcollective).  

Vladimir Vuksan

unread,
May 31, 2012, 3:32:27 PM5/31/12
to devops-t...@googlegroups.com
I haven't used NSCA in years :-). I am big proponent of using your
trending data to do bulk of your alerting. That scales quite nicely. Also
doing consolidated checks works very well.

Vladimir

Scott Smith

unread,
May 31, 2012, 3:35:19 PM5/31/12
to devops-t...@googlegroups.com
Ya, definitely. The majority of my service checks now are using graphite data

Mark Smith

unread,
May 31, 2012, 3:44:55 PM5/31/12
to devops-t...@googlegroups.com
Well, hello there mailing list ...

I go around this merry-go-round every once in a while. To this day, I only like bits and pieces from each system and don't like any of them well enough to deploy them. Most of the web-sites for monitoring systems scream "commercial! enterprise!" at me and at the end of the day I don't want marketing literature, I want to monitor my machines quickly, easily, efficiently, and fairly completely.

That said, I still mostly use Nagios for the day-to-day monitoring, because it works OK. I don't have it do my alerting though, PagerDuty (pagerduty.com) is a great system for managing oncall rotations, SMS/phone/email alerts, escalation, etc. They have fairly solid Nagios integration, too. (The downside is that if you ack the PagerDuty alert, it doesn't ack the Nagios alert.)

I also have written a REST API for Nagios (https://github.com/xb95/nagios-api) which removes a lot of the pain of having shell scripts do things like schedule downtime. A small curl request and/or using the included CLI gets the job done. I'm also using this API to build it in to our dashboards so I can have an easy web-accessible Nagios system -- JavaScript + REST is easy.

For trending/analysis, I use OpenTSDB (http://opentsdb.net/) -- it's a fairly new project, but pretty excellent. Those of you who have ever worked at Google will probably recognize the output and functionality. It's a timeseries database with metrics and tags that lets you slice and dice without ever losing granularity. There are also Nagios plugins that pull from the dataset so you can do more complicated monitoring like:

* Alert if traffic is down more than 15% hour-over-hour (or up, or week-over-week, day-over-day, or any other interval)
* Alert if mean CPU usage is >80% over the last 10 minutes

Lots of other possibilities, too. It's also great for providing graphs for your dashboards. It's got some problems and some things that the UI can't do yet that I want, but it's far more useful than RRD based solutions I've tried. (Caveat lector: I have not played with Graphite, which seems to be popular.)

/2c


-- 
Mark Smith

ranjib dey

unread,
May 31, 2012, 4:23:17 PM5/31/12
to devops-t...@googlegroups.com
+1 to what mark mentioned. I have used multiple hosted / in-house monitoring systems for our clients as well as our own infrastructure. To date, i find nagios to be the most extensible one, things like  escalation, adaptive monitoring, event handlers etc and everything represented as plain text configurations makes it easy to integrate with other systems. I do seriously miss a charting engine, which i complement with splunk (splunk for nagios) or graphite, or pnp4nagios etc. But i guess theres no silver bullet, depending upon your need, you might find something more suitable. Sensu is really good for large infrastructure, but you can address in nagios differently , though it wont be as straight forward. 

I have unsatisfactory experience with zenoss and zabbox though. It felt like its targeted to a difference audience altogether, i feel more like managing a whole webapp instead of an extensible monitoring/alerting system. But again, some other people might find it suitable for their need.

On the hosted solutions, newrelic, app dynamics let you address fundamentally different problems, they are more like process instrumentation, you can grab method level resource usage , they are technology aware (like RoR, .NET, java etc). Though some of them started doing server monitoring.. its not really their strong hold, and if you are going from a very nagios like background, you might be sad .

Pure server monitoring hosted solutions, like server density, cloudkick are  kinda ok, as in they give you a some alerting, pagerduty integration and basic charting out of the box, but anything unconventional is difficult to do, and some time their own sites are down (seriously,i have experienced this :-)  ).

Things like datadog, and nodable are different variety, not purely monitoring system, but more of analytic on top of data aggregation. 

The new kid , boundary is very promising, they provide something that is cool and useful (rare combination to find :-) ), you can traverse your network traffic flow, segregate them based on location, type etc, see how traffic conversion happens across ma multi tier setup. 

So, unless you are sure and undertaken a pilot study, stick to nagios, cause you can do most of the things.. may be an ugly way,, but you can do it. may not be with a rest like api, may be with livestatus+ xinetd .. 


regards
ranjib

Karanbir Singh

unread,
May 31, 2012, 5:14:36 PM5/31/12
to devops-t...@googlegroups.com
On 05/31/2012 08:32 PM, Vladimir Vuksan wrote:
> I haven't used NSCA in years :-). I am big proponent of using your
> trending data to do bulk of your alerting. That scales quite nicely.
> Also doing consolidated checks works very well.
>

a big +1 for the trending point - I prefer to know when things are
degrading or not-in-their-normal-state rather than a preconceived state
at time of monitoring setup.

When a state goes RED, its already too late. You want to ideally know
when its going to amber from the green and be able to get some level of
estimates on what the time-to-amber is and then a time-to-red.
Admittedly, you need a reasonable corpus of history before these signals
start getting close to reality ( and in some cases they never will ).

Secondly, I always found getting nagios to make reactive decisions was
exceptionally fiddly. Lots of code needs writing, management and
monitoring on top of that etc. You may as well just run cfengine every
30 seconds on the host to make decisions on resource / state changes at
that point.

Then there is the scheduler issue - at moderate resolutions, eg. every
few seconds, across a few hundred resources, you never really know what
or when the tests are being run. I heard that the Icinga guys were going
to address this issue, but haven't kept up to see what the state of play
there is. I really hope they fixed it.

- KB

Jens Braeuer

unread,
Jun 1, 2012, 3:18:25 AM6/1/12
to devops-t...@googlegroups.com
Wow! Thanks for all the answers on the list.

Things that I like about Nagios are
- the large number of existing plugins and the ease of writing new ones
- alerting, escalation features
- is mature, so I never experienced crashes had to debug it, etc.

Things I would like to see in my "new" monitoring solution
- a command line/REST interface, so I can script downtimes
- integration with Graphite/OpenTSDB for the usecase of interactive
graphing (I want resource consumption of different servers in 1 graph)
- alerts based on trending
- ability to add/remove hosts easily (AWS autoscaling brings up a new
instance - instance should add itself to the monitoring)
- (maybe a nice UI)

So far, I use a two level approach. Monit on the host itself to monitor
CPU, filesystem, etc. Then Nagios to make sure Monit works and
everything is green. This works around the scaling issues, as it reduces
the number of checks. In addition I use Munin to collect resource usage
and feed this data into graphite. (No trend based alerts yet).

Jens

ranjib dey

unread,
Jun 1, 2012, 3:26:11 AM6/1/12
to devops-t...@googlegroups.com
graphios can push all nagios perf data in graphite, but given you are using monit to do so.it might not capture all the metrics for you. 
I have used nagios event handler + `knife <ec2/openstack> server create/delete` for autoscaling inside private cloud or aws.  I think you can use the same for hp/rackspace cloud too,

Nicholas Tang

unread,
Jun 1, 2012, 10:51:37 AM6/1/12
to devops-t...@googlegroups.com
We get most of this from Zenoss... some built-in, some that we've hacked on top of it.  (For instance: we have an init script that adds a machine to Zenoss via its API on boot, and deregisters itself on shutdown.  We pop that on all of our AMI's (or rather, are in the process of popping it on all of our AMI's ;) ) that handles our autoscaling monitoring needs.)

I'm not a huge fan of Zenoss, personally, but it has accomplished what we need so far, and my team likes it, so occasionally grumble in the background but work with it.  :)

Nicholas

sean escriva

unread,
Jun 1, 2012, 1:26:46 PM6/1/12
to devops-t...@googlegroups.com
A little late chiming in, but I've had great results with a combination of sensu, collectd and graphite. I gave a brief introduction recently at chefconf on how we tie these together based on what works for us. [1]

Our experience has been very good with each of these components individually. Sensu in particular is a solid, scalable, Nagios replacement.

Stacey Schneider

unread,
Jun 4, 2012, 1:40:56 PM6/4/12
to devops-t...@googlegroups.com
It does depend a lot on what you are trying to monitor for the style of monitoring you are looking at. Nagios is a solid "toolbelt" option that most admins start with. Hyperic (open source) did a upgrade path from Nagios so you can include all existing monitoring there, but then you can also do easier management of managed resources, auto-discovery, and a host of graphing, alerting and cross-language/platform features that typically take a bit of work in Nagios to deploy. Check it out:

Leigh Heyman

unread,
Jun 4, 2012, 10:52:32 PM6/4/12
to devops-t...@googlegroups.com
I'll chime in here also a little on the late (and long) side.

Like several others on this thread I've been through the cycle of trying to find something "better" only to wind up back with nagios.  While we would occasionally find tools that had a feature or two that worked better, none ever really seemed better enough to justify investing in the work of migrating a fairly complex monitoring footprint (if yours wasn't you probably wouldn't be seeking an alternative). 

I constantly sought (what I thought was) the holy grail: a system that could combine both alerting and trending in one.   But I came to believe that thinking is inherently flawed. 

I came instead to the maxim that your alerting system should have a very high barrier of entry for new checks, while your trending/graphing system should be as easy as possible to add new data.

Put simply, the objectives of the two tools, are in  opposing tension. You don't want to be paged for every little thing (e.g. I don't need a page about the site being down, if I just got paged that the db behind the site is down, right?), yet likewise, you don't want to be stuck unable to figure out what's wrong, because you weren't graphing the one thing that would help you. 

There's an entropy problem in the signal to noise ratio, which I realized was at the core of why I kept wanting to abandon nagios. We'd go through these cycles where the SnR for nagois would slowly get unworkable, we'd decide nagios sucked, spend a ton of time trying trying to figure out a better tool, then eventually just decide instead to take a hatchet to our nagios configs, write some more scripts wrestle the SnR under control, then rinse and repeat a few months later. Finally, after one of these iterations, we realized it wasn't the tool, but how we were using it.  One of the guys finally made a rule, "all alerts must be actionable." From then on any new check was subject to a team review where the person proposing the new check had to explain to the rest of the team what the action was if it triggered. 

Once we accepted that mindset, we broke out of the signal to noise cycle, and we're able to spend our time improving our overall ecosystem instead of wasting time re-evaluating and re-engineering our monitoring and graphing every few months. 

-L

John Vincent

unread,
Jun 5, 2012, 12:19:04 AM6/5/12
to devops-t...@googlegroups.com
On Mon, Jun 4, 2012 at 10:52 PM, Leigh Heyman <leigh....@gmail.com> wrote:
> I'll chime in here also a little on the late (and long) side.
>
> Like several others on this thread I've been through the cycle of trying to
> find something "better" only to wind up back with nagios.  While we would
> occasionally find tools that had a feature or two that worked better, none
> ever really seemed better enough to justify investing in the work of
> migrating a fairly complex monitoring footprint (if yours wasn't you
> probably wouldn't be seeking an alternative).
>
> I constantly sought (what I thought was) the holy grail: a system that could
> combine both alerting and trending in one.   But I came to believe that
> thinking is inherently flawed.
>
> I came instead to the maxim that your alerting system should have a very
> high barrier of entry for new checks, while your trending/graphing system
> should be as easy as possible to add new data.
>

While I agree with your general theme, I disagree with the
high-barrier on the check side. Discouraging checks means discouraging
monitoring means discouraging knowing when shit blows up. Check
accessibility and trending accessibility are orthogonal.

> Put simply, the objectives of the two tools, are in  opposing tension. You
> don't want to be paged for every little thing (e.g. I don't need a page
> about the site being down, if I just got paged that the db behind the site
> is down, right?), yet likewise, you don't want to be stuck unable to figure
> out what's wrong, because you weren't graphing the one thing that would help
> you.
>

The thing is registering an intent with your check system needs to be
something automate-able (ideally over an API). I despise the idea that
I have to BOUNCE Nagios to add a new check or a new host. I have to
STOP monitoring to add new monitors? In the most colorful language
possible, how fucked is that?

> There's an entropy problem in the signal to noise ratio, which I realized
> was at the core of why I kept wanting to abandon nagios. We'd go through
> these cycles where the SnR for nagois would slowly get unworkable, we'd
> decide nagios sucked, spend a ton of time trying trying to figure out a
> better tool, then eventually just decide instead to take a hatchet to our
> nagios configs, write some more scripts wrestle the SnR under control, then
> rinse and repeat a few months later. Finally, after one of these iterations,
> we realized it wasn't the tool, but how we were using it.  One of the guys
> finally made a rule, "all alerts must be actionable." From then on any new
> check was subject to a team review where the person proposing the new check
> had to explain to the rest of the team what the action was if it triggered.
>

This is a soft skill and classification issue though. You should never
be discouraged for adding visbility in to your infra. On the flip
side, not all checks are equal. If something isn't actionable, you
shouldn't be alerting on it. Get rid of any "known errors". Those are
killer. Just last week I didn't listen to my own advice and ignored an
alarm because I misread it.

We also need to get out of this binary model of good vs. bad. We need
intelligence in our monitoring. Yeah I might have lost a riak node
overnight for some reason but that's okay because the cluster is 5
nodes. Now if I lose another, I'm going to start getting concerned.
Are they in the same rack? Maybe they're in the same AZ and it's
having flakiness. Maybe it's just a spurious error. Does it always
happen at the same time? We can bake some of this intelligence in but
not with something like Nagios. (well maybe you can. An event handler
that queries some sort of CEP or something to determine the "color".
And whatever system does that needs to handle user-defined nuances.

Somewhere John Allspaw is having trouble sleeping because I just
suggested more automation ;)

> Once we accepted that mindset, we broke out of the signal to noise cycle,
> and we're able to spend our time improving our overall ecosystem instead of
> wasting time re-evaluating and re-engineering our monitoring and graphing
> every few months.
>

I've got a new world view that I want to live in. One where
applications and hosts, by virtue of configuration management register
their intent with a monitoring system. They just come online and start
pushing metrics and checking in with the monitoring system. If they
don't check in in some sort of reasonable interval, the monitoring
system goes into "yo dude. you okay over there?" mode on a well known
port (i.e. zeromq agent on all machines). Monitors could even discover
new nodes more intelligently than port scanning a subnet and crashing
print servers. New nodes register presence akin to XMPP.

Imagine an IRC channel with all your servers and your monitoring host
with channel ops. Many folks are doing something like this with
Zookeeper as well.

We need to start leveraging discoveries and ideas that have come up
since Netsaint was written. Things like service discovery (DNS TXT/SRV
records anyone?).

Apologies if this sounded like a directed attack at you. It most
assuredly wasn't. Monitoring in general is something that's been
bugging me for a while.
> -L
>
>
> On Monday, June 4, 2012, Stacey Schneider wrote:
>>
>> It does depend a lot on what you are trying to monitor for the style of
>> monitoring you are looking at. Nagios is a solid "toolbelt" option that most
>> admins start with. Hyperic (open source) did a upgrade path from Nagios so
>> you can include all existing monitoring there, but then you can also do
>> easier management of managed resources, auto-discovery, and a host of
>> graphing, alerting and cross-language/platform features that typically take
>> a bit of work in Nagios to deploy. Check it out:
>>
>> Product page: http://www.hyperic.com/products/nagios-monitoring
>> Plugin details: http://support.hyperic.com/display/hyperforge/Nagios
>> Hyperic Nagios
>> documentation: http://support.hyperic.com/display/DOCS46/Nagios



--
John E. Vincent
http://about.me/lusis

Eric Shamow

unread,
Jun 5, 2012, 9:59:17 AM6/5/12
to devops-t...@googlegroups.com
In general a terrific response.  A few replies inline -

On Tuesday, June 5, 2012 at 12:19 AM, John Vincent wrote:

While I agree with your general theme, I disagree with the
high-barrier on the check side. Discouraging checks means discouraging
monitoring means discouraging knowing when shit blows up. Check
accessibility and trending accessibility are orthogonal.

I have made it a point in past environments to distinguish very clearly between "monitoring" and "alerting."

I'll monitor any metric you like and put it on a dashboard for you.  But I don't want to be alerted unless it's something that is actionable, and I don't want an escalation chain attached to it unless, at some point, somebody should be woken up if it doesn't get resolved.

Separating those two has kept a lot of the frivolous "well, we should monitor xyz metric because you never know…" stuff from keeping my admins up all night.
 

-Eric

-- 

Eric Shamow
Professional Services

Bryan Berry

unread,
Jun 5, 2012, 10:06:35 AM6/5/12
to devops-t...@googlegroups.com
big +1 for Sean's monitoring talk, I saw it in person and watched it online, learning a lot both times

http://www.youtube.com/watch?v=BXxtdE-Paco 

Vladimir Vuksan

unread,
Jun 5, 2012, 10:28:49 AM6/5/12
to devops-t...@googlegroups.com
On Tue, 5 Jun 2012, John Vincent wrote:

> While I agree with your general theme, I disagree with the
> high-barrier on the check side. Discouraging checks means discouraging
> monitoring means discouraging knowing when shit blows up. Check
> accessibility and trending accessibility are orthogonal.


Not necessarily. It just sets a proper bar for alerting cause people will
set up non-actionable alerts that fire all the time desensitizing people.

(.snip.)

> Imagine an IRC channel with all your servers and your monitoring host
> with channel ops. Many folks are doing something like this with
> Zookeeper as well.
>
> We need to start leveraging discoveries and ideas that have come up
> since Netsaint was written. Things like service discovery (DNS TXT/SRV
> records anyone?).
>
> Apologies if this sounded like a directed attack at you. It most
> assuredly wasn't. Monitoring in general is something that's been
> bugging me for a while.


I don't necessarily disagree with these points but in my opinion
monitoringsucks stuff is misplaced as it treats problem with monitoring as
a tools problem and not a people problem. This is not to say that we can't
use better tools but in most cases issues I observe are:

- Too many alerts / non-actionable alerts
- Alert thresholds chosen inadequately ie. alert when CPU > 50% yet
machine's normal operating range is 40-60%
- Reliance on reactive monitoring instead of proactive ie. lots of
people don't care of want to bother watching the trending graphs and only
want to be alerted "when something is wrong"
- "Mindless" metrics watching ie. take standard disk, load, swap checks
and don't alert beyond that because identifying additional checks is "too
hard"

People like to take the easy way out and monitoring is certainly not the
most exciting thing. We need to make sure that people care enough about
monitoring and things that are monitored. No tool can implement that.

Vladimir

John Vincent

unread,
Jun 5, 2012, 10:54:48 AM6/5/12
to devops-t...@googlegroups.com
On Tue, Jun 5, 2012 at 10:28 AM, Vladimir Vuksan <vli...@veus.hr> wrote:
> On Tue, 5 Jun 2012, John Vincent wrote:
>
>> While I agree with your general theme, I disagree with the
>> high-barrier on the check side. Discouraging checks means discouraging
>> monitoring means discouraging knowing when shit blows up. Check
>> accessibility and trending accessibility are orthogonal.
>
>
>
> Not necessarily. It just sets a proper bar for alerting cause people will
> set up non-actionable alerts that fire all the time desensitizing people.
>
> (.snip.)
>

Fair enough but it does play to section below.

>
>> Imagine an IRC channel with all your servers and your monitoring host
>> with channel ops. Many folks are doing something like this with
>> Zookeeper as well.
>>
>> We need to start leveraging discoveries and ideas that have come up
>> since Netsaint was written. Things like service discovery (DNS TXT/SRV
>> records anyone?).
>>
>> Apologies if this sounded like a directed attack at you. It most
>> assuredly wasn't. Monitoring in general is something that's been
>> bugging me for a while.
>
>
>
> I don't necessarily disagree with these points but in my opinion
> monitoringsucks stuff is misplaced as it treats problem with monitoring as a
> tools problem and not a people problem. This is not to say that we can't use
> better tools but in most cases issues I observe are:
>
>  - Too many alerts / non-actionable alerts

Yes and no. Non-actionable alerts need to DIAF. Too many alerts is
relative. I'll address that below.

>  - Alert thresholds chosen inadequately ie. alert when CPU > 50% yet
> machine's normal operating range is 40-60%

+9000.

>  - Reliance on reactive monitoring instead of proactive ie. lots of people
> don't care of want to bother watching the trending graphs and only want to
> be alerted "when something is wrong"

Soft skill (sort of)

>  - "Mindless" metrics watching ie. take standard disk, load, swap checks and
> don't alert beyond that because identifying additional checks is "too hard"
>

Now we're getting somewhere.

> People like to take the easy way out and monitoring is certainly not the
> most exciting thing. We need to make sure that people care enough about
> monitoring and things that are monitored. No tool can implement that.
>

So we agree that monitoring is both a cultural and technical issue. My
problem with most of the current tools is that they're either too
limited in what they can do or too difficult to maintain.

Maybe the issue that we need to tackle is that some aspect of defining
checks needs to be self-service. That lowers the barrier to entry. How
this plays into a fully CM-managed environment, I don't know yet.
Something like graphite-tattle
(https://github.com/wayfair/Graphite-Tattle) feels like it's headed
down the right path in that regard.


And again, there needs to be a culture that says "hey gang, really
it's okay if we disable this check" or "it's really okay not to notify
on this". The tools need to allow the end-user to map these ideals
efficiently and easily.

> Vladimir

JaimeGago

unread,
Jun 5, 2012, 11:02:20 AM6/5/12
to devops-t...@googlegroups.com
Vladimir not trying to go into flaming here but I am not so sure about your statement relative to Monitoring, Alerting and human action linked to it that is a no brainer.
But IRT monitoring if we push the "Infrastructure as Code" concept then monitoring could be considered as an input to a self healing mechanism with no human interaction, granted we enter now the realm of AI (the google car immediately comes to mind) and leave the realm of "tools" and yes it's not for tomorrow, but it's not for the next century either...

In the end if Monitoring can be categorized as part of SysAdmin tasks then I'd like to quote Steve Traugott in his "Why Order Matters: Turing Equivalence in Automated Systems Administration"

"One interesting result of automated systems administration efforts might be that, like the term 'computer', the term 'system administrator' may someday evolve to mean a piece of technology rather than a chained human."


J.

Gildas

unread,
Jun 5, 2012, 11:25:20 AM6/5/12
to devops-t...@googlegroups.com
On Tue, Jun 5, 2012 at 4:54 PM, John Vincent <lusi...@gmail.com> wrote:
> On Tue, Jun 5, 2012 at 10:28 AM, Vladimir Vuksan <vli...@veus.hr> wrote:
>> On Tue, 5 Jun 2012, John Vincent wrote:
[snip]
>>  - Too many alerts / non-actionable alerts
>
> Yes and no. Non-actionable alerts need to DIAF. Too many alerts is
> relative. I'll address that below.
[snip]
>>  - Reliance on reactive monitoring instead of proactive ie. lots of people
>> don't care of want to bother watching the trending graphs and only want to
>> be alerted "when something is wrong"
>
> Soft skill (sort of)
>
>>  - "Mindless" metrics watching ie. take standard disk, load, swap checks and
>> don't alert beyond that because identifying additional checks is "too hard"

Even if we only consider this, I think what's plain wrong nowadays is
the idea of threshold as alerting when we're over a threshold can only
lead to reactive behaviour.

To be proactive you would need to be able to automatically detect:
- trends (the "slope" of the graph is very steep and the filesystem
will fill up in x minutes)
- changes in behaviour (the cpu usage is twice the usual cpu usage
for that machine on a tuesday @ 5pm; app server 23 memory usage is
higher than the other 22 app server in the cluster)

I'd love to hear about tools that already do that!

Also, I firmly believe that 4th generation CM software will need to
make dependencies explicit and that monitoring should be "integrated".

I you define something like (I'm over simplifying): apache on host A
-depend-> mysql on host B
then I believe it should be your CM agent on host A that should report
the issue and try to fix (install mysql on B, correct configuration,
restart, whatever it takes).
then you would be able to configure reporting/alerting only if
automatic correction fails.

Of course not everything can derive from CM dependencies and there
might be a need for "external checks" for stuff like client-exposed
services, but I think most of the use cases would be clear from the
dependencies. (on a side note, firewall rules could be derived from
the same sort of approach: open ports depending on the
relationship/dependency graph).

I would very much like to propose an open space session on this
subject during Devopsdays MV 2012 :)

Cheers,
Gildas

Vladimir Vuksan

unread,
Jun 5, 2012, 11:31:19 AM6/5/12
to devops-t...@googlegroups.com
Ideally an alert never fires since you have

- fixed all the issues
- built auto-remediation systems that take care of issues

We are just very far from it.

Vladimir

ranjib dey

unread,
Jun 5, 2012, 11:32:32 AM6/5/12
to devops-t...@googlegroups.com
+1 ,
As you develop more elastic, feedback driven infrastructure you realize raw threshold values are really cumbersome. What i am opting fir now days is to have a secondary metric which is derived from the primary metric. Slopes in   trends of secondary metrics are alerts. 
In more concrete way. you can feed  raw metrics in nagios via nsca /nrpe and then push it in graphite and then have secondary checks in nagios which reads graphite, functions like moving average etc are readily available. 

Vladimir Vuksan

unread,
Jun 5, 2012, 11:54:37 AM6/5/12
to devops-t...@googlegroups.com


On Tue, 5 Jun 2012, Gildas wrote:

> Even if we only consider this, I think what's plain wrong nowadays is
> the idea of threshold as alerting when we're over a threshold can only
> lead to reactive behaviour.
>
> To be proactive you would need to be able to automatically detect:
> - trends (the "slope" of the graph is very steep and the filesystem
> will fill up in x minutes)


This is still a threshold :-) ie. how many minutes before x is reached do
you alert. You are just combining metrics.

Trending is no panacea either.



Vladimir

Gildas

unread,
Jun 5, 2012, 12:01:58 PM6/5/12
to devops-t...@googlegroups.com
Sure. But I still find it more valuable to be proactive based on such
a threshold (the trend) than being reactive when say a filesystem is
over 70% but filling real slowly, forcing you to change the threshold.

And although it's clearly not a panacea, and I don't think it should
text an admin at 3am if that's one machine in a cluster of 20 and
there's plenty of failsafes, I still believe that could lead to less
false positives (and thus more valuable data to base decisions on).

Gildas

ranjib dey

unread,
Jun 5, 2012, 12:07:19 PM6/5/12
to devops-t...@googlegroups.com
ever wondered why w shows 3 load average? there must be some reason right? when you should use all three, when you should use only one (which one? ) is upto you. 

Peco Karayanev

unread,
Jun 6, 2012, 6:26:07 PM6/6/12
to devops-t...@googlegroups.com
Auto-healing/proactive approach works (and I have used this in the past) in simple scenarios for well known failure points. Unfortunately with the complexity of existing enterprise/cloud environments where you have n number of interacting systems and apps, it is fairly difficult to troubleshoot and diagnose problems, let alone be proactive about solving them. In complex systems most problems end up being unique, you can't really automate unpredictable scenarios. The Application Performance Management space/industry has made huge strides (cross tier transaction tracing, automated metric correlation, real-time sampling, event analysis) but there is still quite more to be accomplished before we can accomplish autohealing.
The same "monitoring" data can be used for other purposes, not just auto heal - scaling nodes for example based on workload, or auto tuning a cache based on how the system is used, or shifting geographical presence based on where data is being accessed the most. Enter "Analytics in the loop", analytics about the system become part of the system itself as a cross cutting concern. Eventually, I think we will have adaptive systems (auto healing included), but need to solve the more "mundane" problems first - have a fuller spectrum of high resolution system,network and application data and the analytical tools to make sense of it at scale.

modi....@gmail.com

unread,
Jun 6, 2012, 10:26:24 PM6/6/12
to devops-t...@googlegroups.com
This really is a handy approach, what I'm currently working on is a scenario wherein the thresholds are changing depending on day of week, and time of day .

What I'm trying to write is a generic framework over an existing alerting system like Nagios etc to help collect the thresholds from users and push them accordingly to the alerting system !
Sent from BlackBerry® on Airtel

Bikash B

unread,
Jul 5, 2012, 7:46:15 AM7/5/12
to devops-t...@googlegroups.com
Can you provide me with working example ? I am kinda lost here.

I need nagios to check for some event in one server and based on that event i need to fireup an EC2 EBS instance 


B

Devdas Bhagat

unread,
Jul 5, 2012, 10:06:10 AM7/5/12
to devops-t...@googlegroups.com
On Thu, Jul 5, 2012 at 1:46 PM, Bikash B <bikash....@gmail.com> wrote:
> Can you provide me with working example ? I am kinda lost here.
>
> I need nagios to check for some event in one server and based on that event
> i need to fireup an EC2 EBS instance
>
Look up Nagios eventhandlers.

http://nagios.sourceforge.net/docs/3_0/eventhandlers.html

Devdas Bhagat

Ranjib Dey

unread,
Jul 5, 2012, 1:06:34 PM7/5/12
to devops-t...@googlegroups.com
u mean ec2 instance backed by ebs.

yeah..
adding to what 's already mentioned, yuu can use knife ec2 plugin, or raw fog (if you are in ruby) or boto (python) or the AWS command line tools (if you like shell) , to spawn instance etc,
also if you are on 64bit arch, then you can resize existing instances also.
what is difficult is to find out the metric when you trigger auto scale, both up and down. I am experimenting with secondary metrics, derived from graphite , built from primary metrics (in nagios). Would love to hear what others are doing. 

regards
ranjib

Willem

unread,
Oct 11, 2013, 7:02:33 AM10/11/13
to devops-t...@googlegroups.com
Hi! It's been a year since, and I got a lot of inspiration from this thread. But I wonder what new architectures or ideas people came up with in the past 12 months. 

Let me start with my own humble observations. Architecture wise, I see a trend towards decentralization: local service discovery, push results and device-initiated registration. This architecture is especially valuable in an environment with cloud instances that are spawned from a CMDB. Example: Sensu does a great job with its subscription model. 

Another trend is outsourcing of monitoring using SaaS providers. Serverdensity, Wormly, Bijk, Scoutapp all provide a drop-in Nagios replacement with a fancy dashboard and Pagerduty integration for 2.5 - 5 USD/device/month. Curiously, while their marginal costs should be negligible, they don't cater for large volumes. I think their price model dictates a tipping point of about 200 servers, beyond that it seems more economical to deploy and maintain your own monitoring system. 

Another trend is merging of trend and incident data. In the past, people were running separate Nagios and Munin systems. SaaS providers have adopted a merged model, with a focus on incidents (Serverdensity) or trends (Librato Metrics) but catering for both. Some people reported integrating Graphite (trends) and Sensu (incidents). 

For our own operations (1200 devices, currently with Nagios and Graphite) we haven't decided yet. Preferably, I would outsource everything but that is unwarranted as long as it would triple my monitoring costs. Second best, I like the Sensu-Graphite combination from an architecture perspective but I am a bit worried about the limited maturity of sensu and its apparently declining developer interest (https://github.com/sensu/sensu/graphs/commit-activity). Third, I like the Librato (graphite-like) hosted service with Pagerduty integration, however, they don't seem to focus on (large) server installs, as there is AFAIK no way to group servers into roles and apply notification templates. 

Cheers
Willem

Some references




Walid

unread,
Oct 13, 2013, 3:41:14 AM10/13/13
to devops-t...@googlegroups.com
The few dollars per month pricing works fine for small infrastructures, however as you said there should be a different metric than node and month for larger infrastructures. a combination of monitoring (trends, and failures) as well logging, and configuration management automation would be a good complementing solutions to answer most of any site requirements. 


--
You received this message because you are subscribed to the Google Groups "devops-toolchain" group.
To unsubscribe from this group and stop receiving emails from it, send an email to devops-toolcha...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Douglas Land

unread,
Oct 17, 2013, 2:52:49 AM10/17/13
to devops-t...@googlegroups.com
While not intended to be a centralized monitoring solution, I wanted to throw out something I've been working on lately:


It's intended to be a per server service (the rational is inside the readme), and I'm also using it as a status system to do things like manage load-balancer pools. I'm still concepting it out so any feedback is appreciated.

Adrian Cole

unread,
May 30, 2014, 1:52:03 AM5/30/14
to devops-t...@googlegroups.com
One aspect glanced upon a couple times in this thread is auto-remediation/self-healing. Recently, it seems there's no shortage of chatter about BOSH and it's wolverine-like powers. What are folks using or looking towards for host level auto-remediation these days?

-A

Jaime

unread,
May 30, 2014, 8:34:44 PM5/30/14
to devops-t...@googlegroups.com
I've recently implemented a Sensu/Graphite/Mcollective pipeline with some basic remediation/self healing, so far so good.

The architecture is a copy paste from these 2 posts by the same _brilliant_ guy:


I like the modular approach of this setup, granted it has a lot of moving parts but its capabilities are indeed wolverine+professor Xavier like.
Imagine that you're good enough with stats and probabilities and know your system that well that you know when service XYZ is going to break based on say "the value of sum of the derivatives of metrics foo, bar compared to the value of metric baz is increasing at a rate of N/min ". Graphite provides you with the data *and* maths functions to apply to the metrics on the fly, then Sensu does the healing "routing" (i.e. checking graphite data using functions and talking to MCollective for the actual remediation call), so you see _in theory_ not only can this monitoring setup provide external healing to what it's looking at, it can also take -mathematically calculated- preventive measures!

There is major caveat here which is an implementation mistake could end up with restarting the entire XYZ cluster  every 10 seconds, so having safeguards everywhere they're possible is not optional.

J.
Reply all
Reply to author
Forward
0 new messages