How do I query for number of times a service is down

2,259 views
Skip to first unread message

awa...@gmail.com

unread,
Jul 6, 2017, 12:49:52 PM7/6/17
to Prometheus Developers
Hi,

I am trying to work with the UP metrics to determine the number of times the service was down for less than a minute (potentially a network hiccup) during a time range (or per hour).

The best I got so far is up == 0 would give me a series with points only when the service was down but I am not sure what to do next.

Any help with this type of query would be greatly appreciated

Thanks.

Ben Kochie

unread,
Jul 6, 2017, 1:11:57 PM7/6/17
to awa...@gmail.com, Prometheus Developers
There are a number of functions that will help with this.

My favorite is simply doing:

avg_over_time(up[1h]) which will give you a float percent of the uptime.



--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/5d88134d-7099-404b-9b1b-2bbd338b3f04%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

awa...@gmail.com

unread,
Jul 6, 2017, 2:01:57 PM7/6/17
to Prometheus Developers, awa...@gmail.com
Thanks Ben for your quick reply.

I am however not sure I get the full picture of your suggestion. I need to do a count of number of times the up metric was ==0 for less than a minute and then get those in bins of hourly count.

On Thursday, July 6, 2017 at 1:11:57 PM UTC-4, Ben Kochie wrote:
> There are a number of functions that will help with this.
>
>
> My favorite is simply doing:
>
>
> avg_over_time(up[1h]) which will give you a float percent of the uptime.
>
>
>
>
> On Thu, Jul 6, 2017 at 6:49 PM, <awa...@gmail.com> wrote:
> Hi,
>
>
>
> I am trying to work with the UP metrics to determine the number of times the service was down for less than a minute (potentially a network hiccup) during a time range (or per hour).
>
>
>
> The best I got so far is up == 0 would give me a series with points only when the service was down but I am not sure what to do next.
>
>
>
> Any help with this type of query would be greatly appreciated
>
>
>
> Thanks.
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
>

> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
>
> To post to this group, send email to prometheus...@googlegroups.com.

Ben Kochie

unread,
Jul 6, 2017, 2:23:51 PM7/6/17
to awa...@gmail.com, Prometheus Developers
My idea is that instead of thinking about specific buckets, you can simplify things based on a SLO/SLA metric.

Say you have a 15s scrape interval, that's  samples per hour.

If one lost sample per hour was ok (99.58%), you could set an alert for uptime average below 99.5%

This is much easier to deal with than trying to line up buckets in the way you are trying to do.

However, what you're asking for is possible.

On Jul 6, 2017 20:01, <awa...@gmail.com> wrote:
Thanks Ben for your quick reply.

I am however not sure I get the full picture of your suggestion. I need to do a count of number of times the up metric was ==0 for less than a minute and then get those in bins of hourly count.

On Thursday, July 6, 2017 at 1:11:57 PM UTC-4, Ben Kochie wrote:
> There are a number of functions that will help with this.
>
>
> My favorite is simply doing:
>
>
> avg_over_time(up[1h]) which will give you a float percent of the uptime.
>
>
>
>
> On Thu, Jul 6, 2017 at 6:49 PM,  <awa...@gmail.com> wrote:
> Hi,
>
>
>
> I am trying to work with the UP metrics to determine the number of times the service was down for less than a minute (potentially a network hiccup) during a time range (or per hour).
>
>
>
> The best I got so far is up == 0 would give me a series with points only when the service was down but I am not sure what to do next.
>
>
>
> Any help with this type of query would be greatly appreciated
>
>
>
> Thanks.
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.

>
> To post to this group, send email to prometheus...@googlegroups.com.
>
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/5d88134d-7099-404b-9b1b-2bbd338b3f04%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/4b8bbd3f-20eb-44aa-988f-e083567e11b1%40googlegroups.com.

awa...@gmail.com

unread,
Jul 6, 2017, 2:37:14 PM7/6/17
to Prometheus Developers, awa...@gmail.com
Ah. Now I understand your point. Yes your approach is perfect for alerting and I will end using very shortly.

But unfortunately, I need to create a diagram to show to my network administrator demonstrating the number of times of micro-downtime.


On Thursday, July 6, 2017 at 2:23:51 PM UTC-4, Ben Kochie wrote:
> My idea is that instead of thinking about specific buckets, you can simplify things based on a SLO/SLA metric.
>
>
> Say you have a 15s scrape interval, that's  samples per hour.
>
>
> If one lost sample per hour was ok (99.58%), you could set an alert for uptime average below 99.5%
>
>
> This is much easier to deal with than trying to line up buckets in the way you are trying to do.
>
>
> However, what you're asking for is possible.
>
>
> On Jul 6, 2017 20:01, <awa...@gmail.com> wrote:
> Thanks Ben for your quick reply.
>
>
>
> I am however not sure I get the full picture of your suggestion. I need to do a count of number of times the up metric was ==0 for less than a minute and then get those in bins of hourly count.
>
>
>
> On Thursday, July 6, 2017 at 1:11:57 PM UTC-4, Ben Kochie wrote:
>
> > There are a number of functions that will help with this.
>
> >
>
> >
>
> > My favorite is simply doing:
>
> >
>
> >
>
> > avg_over_time(up[1h]) which will give you a float percent of the uptime.
>
> >
>
> >
>
> >
>
> >
>
> > On Thu, Jul 6, 2017 at 6:49 PM,  <awa...@gmail.com> wrote:
>
> > Hi,
>
> >
>
> >
>
> >
>
> > I am trying to work with the UP metrics to determine the number of times the service was down for less than a minute (potentially a network hiccup) during a time range (or per hour).
>
> >
>
> >
>
> >
>
> > The best I got so far is up == 0 would give me a series with points only when the service was down but I am not sure what to do next.
>
> >
>
> >
>
> >
>
> > Any help with this type of query would be greatly appreciated
>
> >
>
> >
>
> >
>
> > Thanks.
>
> >
>
> >
>
> >
>
> > --
>
> >
>
> > You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
>
> >
>
> > To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
>
> >
>
> > To post to this group, send email to prometheus...@googlegroups.com.
>
> >
>
> > To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/5d88134d-7099-404b-9b1b-2bbd338b3f04%40googlegroups.com.
>
> >
>
> > For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
>
> To post to this group, send email to prometheus...@googlegroups.com.
>

Brian Brazil

unread,
Jul 7, 2017, 6:42:33 AM7/7/17
to awa...@gmail.com, Prometheus Developers
On 6 July 2017 at 19:37, <awa...@gmail.com> wrote:
Ah. Now I understand your point. Yes your approach is perfect for alerting and I will end using very shortly.

But unfortunately, I need to create a diagram to show to my network administrator demonstrating the number of times of micro-downtime.

This sort of reporting query is best done in a scripting language, you're pushing the limits of what's sane and understandable in PromQL.

Something like:

down_under_1m = up * downsince != bool 0 * (time() - downsince) < bool 60
downsince = 
  (up == 1) * 0
or
  downsince != 0 
or 
  up + time()

should do it, and sum_over_time will count the instances. This requires the rule groups feature in Prometheus 2.0 to work correctly.

Brian
 


On Thursday, July 6, 2017 at 2:23:51 PM UTC-4, Ben Kochie wrote:
> My idea is that instead of thinking about specific buckets, you can simplify things based on a SLO/SLA metric.
>
>
> Say you have a 15s scrape interval, that's  samples per hour.
>
>
> If one lost sample per hour was ok (99.58%), you could set an alert for uptime average below 99.5%
>
>
> This is much easier to deal with than trying to line up buckets in the way you are trying to do.
>
>
> However, what you're asking for is possible.
>
>
> On Jul 6, 2017 20:01,  <awa...@gmail.com> wrote:
> Thanks Ben for your quick reply.
>
>
>
> I am however not sure I get the full picture of your suggestion. I need to do a count of number of times the up metric was ==0 for less than a minute and then get those in bins of hourly count.
>
>
>
> On Thursday, July 6, 2017 at 1:11:57 PM UTC-4, Ben Kochie wrote:
>
> > There are a number of functions that will help with this.
>
> >
>
> >
>
> > My favorite is simply doing:
>
> >
>
> >
>
> > avg_over_time(up[1h]) which will give you a float percent of the uptime.
>
> >
>
> >
>
> >
>
> >
>
> > On Thu, Jul 6, 2017 at 6:49 PM,  <awa...@gmail.com> wrote:
>
> > Hi,
>
> >
>
> >
>
> >
>
> > I am trying to work with the UP metrics to determine the number of times the service was down for less than a minute (potentially a network hiccup) during a time range (or per hour).
>
> >
>
> >
>
> >
>
> > The best I got so far is up == 0 would give me a series with points only when the service was down but I am not sure what to do next.
>
> >
>
> >
>
> >
>
> > Any help with this type of query would be greatly appreciated
>
> >
>
> >
>
> >
>
> > Thanks.
>
> >
>
> >
>
> >
>
> > --
>
> >
>
> > You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
>
> >
>
> > To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscri...@googlegroups.com.

>
> >
>
> > To post to this group, send email to prometheus...@googlegroups.com.
>
> >
>
> > To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/5d88134d-7099-404b-9b1b-2bbd338b3f04%40googlegroups.com.
>
> >
>
> > For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscri...@googlegroups.com.

>
> To post to this group, send email to prometheus...@googlegroups.com.
>
--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscri...@googlegroups.com.
To post to this group, send email to prometheus-developers@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/b9f7b7e1-6283-4ea6-8d77-ec01a3aedb80%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

awa...@gmail.com

unread,
Jul 12, 2017, 11:20:23 AM7/12/17
to Prometheus Developers, awa...@gmail.com
Thank you Brian for your reply. I think I am truly pushing the limits of prometheus queries. I have decided to drop this and go with raw data export to be analysed by our BI team.
> > > To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
>
> >
>
> > >
>
> >
>
> > > To post to this group, send email to prometheus...@googlegroups.com.
>
> >
>
> > >
>
> >
>
> > > To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/5d88134d-7099-404b-9b1b-2bbd338b3f04%40googlegroups.com.
>
> >
>
> > >
>
> >
>
> > > For more options, visit https://groups.google.com/d/optout.
>
> >
>
> >
>
> >
>
> > --
>
> >
>
> > You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
>
> >
>
> > To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
>
> >
>
> > To post to this group, send email to prometheus...@googlegroups.com.
>
> >
>
> > To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-developers/4b8bbd3f-20eb-44aa-988f-e083567e11b1%40googlegroups.com.
>
> >
>
> > For more options, visit https://groups.google.com/d/optout.
>
>
>
> --
>
> You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
>
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
>
> To post to this group, send email to prometheus...@googlegroups.com.
>
Reply all
Reply to author
Forward
0 new messages