Monitoring cron using Pushgateway or Statsd-exporter (TTL)

680 views

Skip to first unread message

hash...@gmail.com

unread,

Sep 6, 2016, 10:10:12 AM9/6/16

to Prometheus Developers

Hello,

To monitor a bunch, perhaps all, of our cron jobs I've been experimenting using the pushgateway and later the statsd-exporter. Both "work", but not perfectly.

First I tried using the pushgateway as documented on prometheus.io. Almost immediately I can post how long my job took, when it last finished, etc etc. This left me wishing for a job count, since that would effectively give me an indication of the last time it ran (in stead of storing a timestamp).

So I searched some more for a tool that could increment a job count, and increment the total run time. Welcome statsd-exporter to the stage. It can "remember" the last jobcount. This actually works for most of my use cases of monitoring cron jobs.

Now, my remaining pain is from jobs that disappeared, especially those that disappeared after a failure. (they never leave the fail state).

And for that problem I think a TTL would solve my problem. Now Riemann has TTL's, but meh. This left me wishing that either statsd-exporter would support a TTL, or the pushgateway would support both a TTL and an incremental update.

Is it worth to explore changing the current tools to support a TTL, or has someone else solved this using a different tool?

Kai

Brian Brazil

unread,

Sep 6, 2016, 12:24:21 PM9/6/16

to hash...@gmail.com, Prometheus Developers

On 6 Sep 2016 15:10, <hash...@gmail.com> wrote:
>
> Hello,
>
> To monitor a bunch, perhaps all, of our cron jobs I've been experimenting using the pushgateway and later the statsd-exporter. Both "work", but not perfectly.
>
> First I tried using the pushgateway as documented on prometheus.io. Almost immediately I can post how long my job took, when it last finished, etc etc. This left me wishing for a job count, since that would effectively give me an indication of the last time it ran (in stead of storing a timestamp).

Why do you not want to store a timestamp? That's the standard way of doing this.

> So I searched some more for a tool that could increment a job count, and increment the total run time. Welcome statsd-exporter to the stage. It can "remember" the last jobcount. This actually works for most of my use cases of monitoring cron jobs.
>
> Now, my remaining pain is from jobs that disappeared, especially those that disappeared after a failure. (they never leave the fail state).
>
> And for that problem I think a TTL would solve my problem. Now Riemann has TTL's, but meh. This left me wishing that either statsd-exporter would support a TTL, or the pushgateway would support both a TTL and an incremental update.
>
> Is it worth to explore changing the current tools to support a TTL, or has someone else solved this using a different tool?

If your job has failed and there has yet to be a success since then, then it is correct to alert.

This is more a service turndown/management problem than a monitoring problem.

Brian

>
> Kai
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

hash...@gmail.com

unread,

Sep 7, 2016, 4:47:47 AM9/7/16

to Prometheus Developers, hash...@gmail.com

Hi Brian,

On Tuesday, September 6, 2016 at 6:24:21 PM UTC+2, Brian Brazil wrote:
> On 6 Sep 2016 15:10, <hash...@gmail.com> wrote:
> >
>
> > To monitor a bunch, perhaps all, of our cron jobs I've been experimenting using the pushgateway and later the statsd-exporter. Both "work", but not perfectly.
>
> >
>
> > First I tried using the pushgateway as documented on prometheus.io. Almost immediately I can post how long my job took, when it last finished, etc etc. This left me wishing for a job count, since that would effectively give me an indication of the last time it ran (in stead of storing a timestamp).
>
> Why do you not want to store a timestamp? That's the standard way of doing this.

Because I find it very unexpressive, but meh.

The hardest it is to see how often my job ran if I don't have an increment. See if a job always takes exactly 5 seconds, and succeeds, I won't see the intermediate jobs in prometheus (since the value didn't change); oh, I can do very intelligent things with the timestamp if I'd store it, but it makes a horrible way to talk about something compared to a plain counter.

I very much see a use case for an increment like update to the pushgateway for this. I'm okay if you disagree.

> > So I searched some more for a tool that could increment a job count, and increment the total run time. Welcome statsd-exporter to the stage. It can "remember" the last jobcount. This actually works for most of my use cases of monitoring cron jobs.
>
>
> > Now, my remaining pain is from jobs that disappeared, especially those that disappeared after a failure. (they never leave the fail state).
>
> >
>
> > And for that problem I think a TTL would solve my problem. Now Riemann has TTL's, but meh. This left me wishing that either statsd-exporter would support a TTL, or the pushgateway would support both a TTL and an incremental update.
>
> >
>
> > Is it worth to explore changing the current tools to support a TTL, or has someone else solved this using a different tool?
>
> If your job has failed and there has yet to be a success since then, then it is correct to alert.
>
> This is more a service turndown/management problem than a monitoring problem.

You're right, it is.

Thanks!

Kai

Brian Brazil

unread,

Sep 7, 2016, 4:58:10 AM9/7/16

to hash...@gmail.com, Prometheus Developers

On 7 September 2016 at 09:47, <hash...@gmail.com> wrote:

Hi Brian,

On Tuesday, September 6, 2016 at 6:24:21 PM UTC+2, Brian Brazil wrote:
> On 6 Sep 2016 15:10, <hash...@gmail.com> wrote:
> >
>
> > To monitor a bunch, perhaps all, of our cron jobs I've been experimenting using the pushgateway and later the statsd-exporter. Both "work", but not perfectly.
>
> >
>
> > First I tried using the pushgateway as documented on prometheus.io. Almost immediately I can post how long my job took, when it last finished, etc etc. This left me wishing for a job count, since that would effectively give me an indication of the last time it ran (in stead of storing a timestamp).
>
> Why do you not want to store a timestamp? That's the standard way of doing this.

Because I find it very unexpressive, but meh.

The hardest it is to see how often my job ran if I don't have an increment. See if a job always takes exactly 5 seconds, and succeeds, I won't see the intermediate jobs in prometheus (since the value didn't change); oh, I can do very intelligent things with the timestamp if I'd store it, but it makes a horrible way to talk about something compared to a plain counter.

The changes() function will tell you how often the timestamp changes.

I very much see a use case for an increment like update to the pushgateway for this. I'm okay if you disagree.

This is an explicit non-goal of the pushgateway.

Brian

> > So I searched some more for a tool that could increment a job count, and increment the total run time. Welcome statsd-exporter to the stage. It can "remember" the last jobcount. This actually works for most of my use cases of monitoring cron jobs.
>
>
> > Now, my remaining pain is from jobs that disappeared, especially those that disappeared after a failure. (they never leave the fail state).
>
> >
>
> > And for that problem I think a TTL would solve my problem. Now Riemann has TTL's, but meh. This left me wishing that either statsd-exporter would support a TTL, or the pushgateway would support both a TTL and an incremental update.
>
> >
>
> > Is it worth to explore changing the current tools to support a TTL, or has someone else solved this using a different tool?
>
> If your job has failed and there has yet to be a success since then, then it is correct to alert.
>
> This is more a service turndown/management problem than a monitoring problem.

You're right, it is.

Thanks!

Kai

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Brian Brazil

www.robustperception.io

Reply all

Reply to author

Forward

0 new messages