Raise an alert before system goes down.

61 views
Skip to first unread message

Isabel Noronha

unread,
Feb 20, 2020, 5:26:05 AM2/20/20
to Prometheus Users
I have configured alerts to raise after the system goes down. 
However, is there a way where I can get an alert before the instance goes down in Prometheus?

Stuart Clark

unread,
Feb 20, 2020, 5:33:18 AM2/20/20
to Isabel Noronha, Prometheus Users
As long as there are some metrics which give a good indication that things are about to break then yes.

For example you can alert on 100% disk usage, but you can also alert on disk usage projected to reach 100% within the next hour

On 20 February 2020 10:26:05 GMT, Isabel Noronha <isabeln...@gmail.com> wrote:
I have configured alerts to raise after the system goes down. 
However, is there a way where I can get an alert before the instance goes down in Prometheus?


--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Isabel Noronha

unread,
Feb 24, 2020, 12:04:00 AM2/24/20
to Prometheus Users
Thank you!
What kind of promql metrics can I use to send alerts  that my system will go down in future and hence prevent it?

Stuart Clark

unread,
Feb 24, 2020, 5:23:40 AM2/24/20
to Isabel Noronha, Prometheus Users
On 2020-02-24 05:04, Isabel Noronha wrote:
> Thank you!
> What kind of promql metrics can I use to send alerts that my system
> will go down in future and hence prevent it?

It very much depends on your system and how you can catch errors before
they become too serious.

Examples would be to look at the linear prediction & derivative
functions. For example alerting on disk due to run out within 4 hours,
or a response time SLA due to be breached within 2 hours.

>
> On Thursday, February 20, 2020 at 4:03:18 PM UTC+5:30, Stuart Clark
> wrote:
>
>> As long as there are some metrics which give a good indication that
>> things are about to break then yes.
>>
>> For example you can alert on 100% disk usage, but you can also alert
>> on disk usage projected to reach 100% within the next hour
>>
>> On 20 February 2020 10:26:05 GMT, Isabel Noronha
>> <isabeln...@gmail.com> wrote:
>>
>>> I have configured alerts to raise after the system goes down.
>>> However, is there a way where I can get an alert before the
>>> instance goes down in Prometheus?
>>
>> --
>> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>
> --
> You received this message because you are subscribed to the Google
> Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/20cffe3a-0d53-4455-8c4f-d3e9c92a32da%40googlegroups.com
> [1].
>
>
> Links:
> ------
> [1]
> https://groups.google.com/d/msgid/prometheus-users/20cffe3a-0d53-4455-8c4f-d3e9c92a32da%40googlegroups.com?utm_medium=email&utm_source=footer

--
Stuart Clark

Brian Candler

unread,
Feb 24, 2020, 6:51:46 AM2/24/20
to Prometheus Users
This should be tempered with an expectation that there are many failures that you *can't* predict.  The ones you may be able to predict are around running out of a resource - primarily disk space or RAM/swap.  (See example at end).

This "philosophy on alerting" from an ex-Google site engineer is well worth reading:

The most important take away is: focus your alerting effort on symptoms (i.e. problems seen by the end user) rather than causes.  Alerts then become immediately actionable.  Put another way: if the problem isn't so bad that end users don't notice it, then it's not worth getting somebody out of bed to deal with.  Also, try to design your systems so that they are resilient to the failure of a single node.

Sometimes analysis of these end-user measurements can give you predictions of future failures - e.g. if you find a particular server's response time is higher than normal, this may be a precursor to it failing completely.

Anyway, here are some rules you can use as a starting point.  For the linear predictive ones, note that the "for:" clause requires the condition to be true for twice the duration of the prediction interval.  This means if there's a sudden step in the disk usage, but it then goes back to flat, it doesn't send an alert.

groups:
- name: DiskSpace
  interval: 1m
  rules:
  # Alert if any filesystem has less than 100MB available space.
  - alert: DiskFull
    expr: |
      node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"} < 100000000 and node_filesystem_size_bytes{fstype!~"fuse.*|nfs.*"} > 100000000
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: 'Filesystem full or less than 100MB available space'

- name: DiskRate10m
  interval: 1m
  rules:
  # Warn if rate of growth over last 10 minutes means filesystem will fill in 2 hours
  - alert: DiskFilling
    expr: |
      predict_linear(node_filesystem_free_bytes{fstype!~"fuse.*|nfs.*"}[10m], 7200) < 100000000
    for: 20m
    labels:
      severity: warning
    annotations:
      summary: 'Filesystem will be full in less than 2h at current 10m growth rate'

- name: DiskRate3h
  interval: 10m
  rules:
  # Warn if rate of growth over last 3 hours means filesystem will fill in 2 days
  - alert: DiskFilling
    expr: |
      predict_linear(node_filesystem_free_bytes{fstype!~"fuse.*|nfs.*"}[3h], 2*86400) < 100000000
    for: 6h
    labels:
      severity: warning
    annotations:
      summary: 'Filesystem will be full in less than 2d at current 3h growth rate'

Isabel Noronha

unread,
Feb 25, 2020, 4:56:45 AM2/25/20
to Prometheus Users
Thank you so much!

Isabel Noronha

unread,
Feb 25, 2020, 4:57:03 AM2/25/20
to Prometheus Users
Thank you !
Reply all
Reply to author
Forward
0 new messages