Using Nagios plugins as a data source?

Brian Candler

unread,

Jul 27, 2017, 12:40:59 PM7/27/17

to Prometheus Users

At https://github.com/prometheus/nagios_plugins there is a nagios plugin for checking the data stored in Prometheus.

I am interested in doing it the other way round: being able to run Nagios plugins and store their results as Prometheus time series.

I imagine it would be something similar to NRPE: Prometheus would poll a HTTP endpoint, which would in turn invoke the plugin, and return its result(s). I'd like a time series for the check result (0/1/2 = OK/warning/critical); and if there is structured "performance data"[^1] in the output then I'd like a time series for each data item too.

Has this been done before? I did some googling but didn't find very much. There's this:

https://github.com/m-lab/prometheus-nagios-exporter

This requires an existing check_mk engine which is running the plugins periodically, and then scrapes the system for prometheus.

ISTM it would be simpler and more robust if the plugin were directly triggered by prometheus' own polling cycle. Does anything like this already exist?

Thanks,

Brian.

[^1] https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/3/en/perfdata.html

Julius Volz

unread,

Jul 27, 2017, 12:51:35 PM7/27/17

to Brian Candler, Prometheus Users

I'm not aware of an implementation, but have had that thought sometimes over the years.

One challenge with a check-during-scrape approach is that Nagios plugins can be quite arbitrary command executions taking a long time, whereas Prometheus scrapes should be designed to be really fast. But yeah, you could still build it.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/c50fd9bf-b0f1-4ebf-8fc8-67cb4dcf6de6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Brian Candler

unread,

Jul 27, 2017, 1:56:35 PM7/27/17

to Prometheus Users, b.ca...@pobox.com

> Nagios plugins can be quite arbitrary command executions taking a long time, whereas Prometheus scrapes should be designed to be really fast.

This is true.

I'm not sure how fast "really fast" needs to be. For example, presumably when you scrape the snmp_exporter, it will cause multiple SNMP round-trips to the device for snmp[bulk]walk over various parts of the MIB, each of which make take some millisecond; the whole lot could end up taking tens to hundreds of milliseconds. But this seems to be OK.

I guess prometheus copes with this latency by making parallel outbound connections. Maybe therefore it would be better for each scrape to trigger one plugin, rather than having a single scrape which runs a bunch of plugins and returns a whole set of metrics?

For me, the big benefit of nagios plugins is that they are a doddle to code for all sorts of odd situations which need to be monitored, and typically they'd only be run once a minute or so anyway.

Cheers,

Brian.

Tobias Schmidt

unread,

Jul 27, 2017, 2:13:01 PM7/27/17

to Brian Candler, Prometheus Users

You can configure the scrape timeout. It defaults to 10s iirc. At SoundCloud we used to have some slow jmx scrapes which took upwards of 30s and Prometheus didn't care too much. I don't expect this to be an issue. Just make sure to have timeouts in place when you execute the scripts.

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/fcb38fa6-0d8d-4273-98e3-cd7f55c132fb%40googlegroups.com.

Ben Kochie

unread,

Jul 27, 2017, 2:26:49 PM7/27/17

to Brian Candler, Prometheus Users

As long as it reliably complets in the scrape timeout, you're ok. But of course the timestamp accuracy degrades the longer you scrape, depending on how the data is gathered.

A normal Prometheus library can respond in under 1 second usually, many times in under 100ms.

SNMP is also slow, but gives you all the metrics in one round trip. Some devices can respond in under 1s. But you're right, it is highly variable.

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/fcb38fa6-0d8d-4273-98e3-cd7f55c132fb%40googlegroups.com.

Brian Candler

unread,

Sep 13, 2017, 4:22:05 AM9/13/17

to Prometheus Users

I looked into this a bit more.

AFAICS, the normal way to "federate" nagios instances is:

- run a remote nagios instance which does its own polling

- submit results as passive checks to a central nagios instance, e.g. by configuring ocsp_command to talk to NRDP

https://exchange.nagios.org/directory/Addons/Passive-Checks/NRDP--2D-Nagios-Remote-Data-Processor/details

https://assets.nagios.com/downloads/nrdp/docs/NRDP_Overview.pdf

Now, it seems to me that it wouldn't be too hard to build a nagios_nrdp_exporter which could receive the NRDP messages, update local metrics in RAM, and in turn be scraped by prometheus at whatever polling interval is convenient. Then prometheus doesn't have to do the plugin invocation, and a single scrape gets all the current host/service check statuses.

Initially I thought the push gateway would serve for this role, but there seem to be two limitations:

1. I would want the timestamp to be the time of the service check, not the time of the scrape. Then Nagios plugins could run at (say) 1-5 minute intervals and prometheus could scrape at 10 second intervals, and the timestamp would be the time of the check. Push gateway doesn't work that way:

https://github.com/prometheus/pushgateway#about-timestamps

2. I would want metrics to expire if they have not been updated by a service check for 5-10 minutes. But pushgateway keeps them forever:

https://prometheus.io/docs/practices/pushing/

But: the textfile collector of node_exporter looks like it could be a quick way to prototype this.

Brian Candler

unread,

Sep 14, 2017, 4:35:32 AM9/14/17

to Prometheus Users

One problem: whilst I could keep the plugin status as a value 0/1/2/3, and any perfdata metrics if they are generated [^1], I would have to throw away the plugin text output.

I have a recent real world example to illustrate this.

Last night, it appears that the OCSP responder for a well-known commercial certificate authority went down. Since we run check_ssl_certificate for each server, I got a zillion separate nagios alerts containing a critical status for each certificate individually; and a couple of hours later, a zillion resolved mails.

If I had routed this through prometheus' alert manager, these alerts could have all been grouped into a single mail. Yay!!

However the plugin output contained text information such as:

SSL_CERT CRITICAL www.example.com: Responder Error: unauthorized (6)
SSL_CERT CRITICAL uktest1.example.com: Response Verify Failure

and having that information available in the alert was invaluable in finding the problem.

The normal plugin output is variable, e.g.

SSL_CERT OK - X.509 certificate 'www.example.com' from 'XXX Authority' valid until Jun 29 10:01:02 2019 GMT (expires in 653 days)

so is not suitable as a prometheus label.

So how could I work around this? I was thinking:

- stash the plugin text output somewhere outside of prometheus, e.g. consul

- add this text as a synthesized label when passing alerts onto alertmanager (e.g. <alert_relabel_configs> with __meta_consul_tags)

Another approach might be a proxy between prometheus and alertmanager which adds extra labels from a lookup.

Any other ideas gratefully received. Maybe this is just the wrong tool for the job, but I'm not aware of anything else which has the alert-grouping capability of alertmanager.

Regards,

Brian.

[^1] https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/perfdata.html

Brian Brazil

unread,

Sep 14, 2017, 4:52:15 AM9/14/17

to Brian Candler, Prometheus Users

On 14 September 2017 at 09:35, Brian Candler <b.ca...@pobox.com> wrote:

One problem: whilst I could keep the plugin status as a value 0/1/2/3, and any perfdata metrics if they are generated [^1], I would have to throw away the plugin text output.

I have a recent real world example to illustrate this.

Last night, it appears that the OCSP responder for a well-known commercial certificate authority went down. Since we run check_ssl_certificate for each server, I got a zillion separate nagios alerts containing a critical status for each certificate individually; and a couple of hours later, a zillion resolved mails.

If I had routed this through prometheus' alert manager, these alerts could have all been grouped into a single mail. Yay!!

However the plugin output contained text information such as:

SSL_CERT CRITICAL www.example.com: Responder Error: unauthorized (6)
SSL_CERT CRITICAL uktest1.example.com: Response Verify Failure

and having that information available in the alert was invaluable in finding the problem.

The normal plugin output is variable, e.g.

SSL_CERT OK - X.509 certificate 'www.example.com' from 'XXX Authority' valid until Jun 29 10:01:02 2019 GMT (expires in 653 days)

so is not suitable as a prometheus label.

So how could I work around this? I was thinking:
- stash the plugin text output somewhere outside of prometheus, e.g. consul
- add this text as a synthesized label when passing alerts onto alertmanager (e.g. <alert_relabel_configs> with __meta_consul_tags)

The general idea is you get your alert, and for this sort of alert you would then jump to logs to see what exactly went wrong. This sort of debug-level information would never pass through Prometheus or the Alertmanager.

Brian

Another approach might be a proxy between prometheus and alertmanager which adds extra labels from a lookup.

Any other ideas gratefully received. Maybe this is just the wrong tool for the job, but I'm not aware of anything else which has the alert-grouping capability of alertmanager.

Regards,

Brian.

[^1] https://assets.nagios.com/downloads/nagioscore/docs/nagioscore/4/en/perfdata.html

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/082dd77c-876f-4a39-b1c1-904fdc11c761%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

Reply all

Reply to author

Forward