Giving different Dynamic Thresholds for the same alert.

74 views
Skip to first unread message

Yagyansh S. Kumar

unread,
Mar 14, 2020, 11:32:15 AM3/14/20
to Prometheus Users
Hi. In my prometheus.yml file all the targets necessarily have 2 labels viz "cluster" and "node".

I have configured an alert for CPU Load with a dynamic threshold of "Number of Cores of the server".
Configured alert:
  - alert: HighCpuLoad
    expr: (node_load15 > count without (cpu, mode) (node_cpu_seconds_total{mode="system"})) * on(instance) group_left(nodename) node_uname_info
    for: 5m
    labels:
      severity: "CRITICAL"
    annotations:
      summary: "CPU load on *{{ $labels.instance }}* - *{{ $labels.nodename }}* is more than number of cores of the machine."
      description: "Current Value = *{{ $value | humanize }}*"
      identifier: "*Cluster:* `{{ $labels.cluster }}`, *node:* `{{ $labels.node }}` "

Now, for a particular node value(Let it be A which belongs to X cluster) I want this threshold to be 2*NumberofCores. How can I do this?

Thanks.

Christian Hoffmann

unread,
Mar 14, 2020, 11:56:39 AM3/14/20
to Yagyansh S. Kumar, Prometheus Users
Hi,

On 3/14/20 4:32 PM, Yagyansh S. Kumar wrote:
> Hi. In my prometheus.yml file all the targets necessarily have 2 labels
> viz "cluster" and "node".
[...]
>
> Now, for a particular node value(Let it be A which belongs to X cluster)
> I want this threshold to be 2*NumberofCores. How can I do this?

You could use this pattern:
https://www.robustperception.io/using-time-series-as-alert-thresholds

With a slight variation -- you would not specify the actual threshold in
your "threshold metric". Instead, you would use it as a factor in your
query. If it's absent, you could default to 1.

Kind regards,
Christian

Yagyansh S. Kumar

unread,
Mar 14, 2020, 12:06:23 PM3/14/20
to Prometheus Users
Can you explain in a little detail please?

Christian Hoffmann

unread,
Mar 14, 2020, 12:36:38 PM3/14/20
to Yagyansh S. Kumar, Prometheus Users
On 3/14/20 5:06 PM, Yagyansh S. Kumar wrote:
> Can you explain in a little detail please?
I'll try to walk through your example in several steps:

## Step 1
Your initial expression was this:

(node_load15 > count without (cpu, mode)
(node_cpu_seconds_total{mode="system"})) * on(instance)
group_left(nodename) node_uname_info


## Step 2
Let's drop the info part for now to make things simpler (you can add it
back at the end):

node_load15 > count without (cpu, mode)
(node_cpu_seconds_total{mode="system"})


## Step 3
With that query, you could add a factor. The simplest way would be, to
have two alerts, one for your machines with the 1x factor, one with the
2x factor

node_load15{instance=~"a|b|c"} > count without (cpu, mode)
(node_cpu_seconds_total{mode="system"})

and

node_load15{instance!~"a|b|c"} > count without (cpu, mode)
(node_cpu_seconds_total{mode="system"}) * 2


## Step 4
Depending on your use case, this may be enough already. However, you
would need to modify those two alerts whenever you add a machine. So,
something more scalable would be using a metric (e.g. from a recording
rule) for the scale factor:

node_load15 > count without (cpu, mode)
(node_cpu_seconds_total{mode="system"}) * on(instance) cpu_core_scale_factor

This would require that you have a recording rule for each and every of
your machines:

- record: cpu_core_scale_factor
labels:
instance: a
expr: 1
- record: cpu_core_scale_factor
labels:
instance: c
expr: 2 # factor two


## Step 5
A further simplification regarding maintenance would be, if you could
omit those entries for your more prominent case (just the number of
cores, no multiplication factor).
This is what the linked blog post describes. Sadly, it complicates the
alert rule a little bit:


node_load15 > count without (cpu, mode)
(node_cpu_seconds_total{mode="system"}) * on(instance) group_left() (
cpu_core_scale_factor
or on(instance)
node_load15*0 + 1 # <-- the "1" is the default value
)

The part after group_left() basically returns the value from your factor
recording rule. If it doesn't exist, it calculates a default value. This
works by taking an arbitrary metric which exists exactly once for each
instance. It makes sense to take the same metric which your alert is
based on. The value is multiplied by 0, as we do not care about the
value at all. We then add 1, the default value you wanted. Essentially,
this leads to a temporary, invisible metric. This part might be a bit
hard to get across, but basically you can just copy this pattern verbatim.

In this case, you would only need to add a recording rule for those
machines which should have a non-default (i.e. other than 1) cpu count
scale factor (i.e. the "instance: c" rule above).

# Step 6
As a last suggestion, you might want to revisit if strict alerting on
the system load is so useful at all. In our setup, we do alert on it,
but only on really high values which should only trigger if the load is
skyrocketing (usually due to some hanging network filesystem or other
deadlock situation).


Note: All examples are untested, so take them with a grain of salt. I
just want to get the idea across.

Hope this helps,
Christian

Harald Koch

unread,
Mar 14, 2020, 2:06:37 PM3/14/20
to Prometheus Users
While not directly related to CPU usage and number of cores, I finally got around to publishing a draft of my article on the subject:

https://www.haraldkoch.ca/blog/index.php/2020/03/14/prometheus-alerting-rules-and-metadata/

I'm sure there are errors and omissions!

Feedback appreciated.

--
Harald Koch
c...@pobox.com

Yagyansh S. Kumar

unread,
Mar 14, 2020, 4:53:57 PM3/14/20
to Prometheus Users
Awesome explanation. This helps a lot. Thanks, I appreciate it.

Yagyansh S. Kumar

unread,
Mar 14, 2020, 5:01:28 PM3/14/20
to Prometheus Users
Also, since you mentioned hanging network filesystem, is there any way/logic to find out whether my NFS mount is hanged on a machine or not? I have busted my ass on getting this result, must have tried more than 50 things but still have nothing in this matter.
In our setup we use a lot of NFS and some of the mounts are really critical. All these shared NFS mounts are taken from a 3rd party vendor and due to network lag or IP mismatch or 10 other reasons, the NFS ends up being hanged on a machine or two. I need to know whenever this happens. Anything that can be done here?

Christian Hoffmann

unread,
Mar 14, 2020, 5:19:01 PM3/14/20
to Yagyansh S. Kumar, Prometheus Users
On 3/14/20 10:01 PM, Yagyansh S. Kumar wrote:
> Also, since you mentioned hanging network filesystem, is there any
> way/logic to find out whether my NFS mount is hanged on a machine or
> not? I have busted my ass on getting this result, must have tried more
> than 50 things but still have nothing in this matter.
> In our setup we use a lot of NFS and some of the mounts are really
> critical. All these shared NFS mounts are taken from a 3rd party vendor
> and due to network lag or IP mismatch or 10 other reasons, the NFS ends
> up being hanged on a machine or two. I need to know whenever this
> happens. Anything that can be done here?

I think I would aim for using the regular node_filesystem_device_error
metric nowadays, which is basically the Statfs sucess status.

In earlier node_exporter times, a hung nfs mount could easily prevent
node_exporter from working reliably, which is why we still have nfs
excluded via --collector.filesystem.ignored-fs-types. However, since
#997 [1] this should have been improved. Therefore, I plan to give this
a go again.

Other than that, there are nfs client metrics, but I'm not sure if you
can derive a hung / not hung result from that.

I was about to link to another thread some weeks ago, but I just noticed
that it was started by you as well [2]. ;)

I think that Ben's suggestion is basically the same. Julien's approach
regarding separation of collector's into different jobs (in the same
mail thread) also sounded interesting.

Have you done some experiments with node_filesystem_device_error?

Kind regards,
Christian


[1] https://github.com/prometheus/node_exporter/pull/997
[2]
https://groups.google.com/d/msgid/prometheus-users/CABbyFmqMKQXYNOfdr7BeFA%3Dx%3D5fY%2Bk4EQ8oprL0Wh-8SNqmvoA%40mail.gmail.com?utm_medium=email&utm_source=footer

Yagyansh S. Kumar

unread,
Mar 14, 2020, 5:35:25 PM3/14/20
to Prometheus Users
Yes, I did experiment with node_filesystem_device_error earlier based on Ben's suggestion on my earlier thread, but not extensively. Also, I didn't know it is Statfs success. With what I have read so far on this matter, statfs is the best way to find your filesystem is hanging or not. Hence, I'll definitely give node_filesystem_device_error another try and see if I can come up with something interesting.

Thanks a lot for your help. Cheers!

Christian Hoffmann

unread,
Mar 14, 2020, 6:00:59 PM3/14/20
to Yagyansh S. Kumar, Prometheus Users
Hi,

On 3/14/20 10:35 PM, Yagyansh S. Kumar wrote:
> Yes, I did experiment with node_filesystem_device_error earlier based on
> Ben's suggestion on my earlier thread, but not extensively. Also, I
> didn't know it is Statfs success. With what I have read so far on this
> matter, statfs is the best way to find your filesystem is hanging or
> not. Hence, I'll definitely give node_filesystem_device_error another
> try and see if I can come up with something interesting.
Yeah, this should be it:
https://github.com/prometheus/node_exporter/blob/master/collector/filesystem_linux.go#L78

Please report back with your results -- I'm also highly interested. :)

Kind regards,
Christian

Yagyansh S. Kumar

unread,
Mar 14, 2020, 6:26:19 PM3/14/20
to Prometheus Users
Sure. Will absolutely do.

Yagyansh S. Kumar

unread,
Apr 18, 2020, 12:32:02 PM4/18/20
to Prometheus Users
These are my observations from a 1 week usage of node_filesystem_device_error for my production NFS mounts:

node_filesystem_device_error works quite well for hanging NFS. Whenever a NFS is in hung state node_filesystem_device_error will definitely indicate that there is an issue.
But, this does not hold true for vice versa. i.e if node_filesystem_device_error == 1, it does not necessarily mean the NFS is in hung state, there might be some other reason that statfs call is failing. One of the reasons that I recently noticed is  "Stale file handle".

So, all in all if your NFS is in hung state, node_filesystem_device_error should definitely inform you about the same.

Shruthi P

unread,
Oct 23, 2020, 12:44:40 PM10/23/20
to Prometheus Users
For POSIX , can u please tell what are the scenarios where  node_filesystem_device_error becomes 1 witout going into hug state? 
Reply all
Reply to author
Forward
0 new messages