Giving dynamic thresholds in alertmanager.

38 views
Skip to first unread message

Yagyansh S. Kumar

unread,
Mar 11, 2020, 1:00:57 PM3/11/20
to Prometheus Users
Hi. I have configured alert for CPU Load for my servers and my current threshold is 8 for warning and 10 for critical.
I want to make this threshold dynamic i.e I want the critical alert when the CPU Load becomes greater than the number of CPU Cores of the machine.
Eg. For a server with 8 CPU cores, I want a critical alert when CPU load > 8 and for a machine with 16 CPU cores, I want a critical alert when CPU Load > 16.

Thanks in advance!

Harald Koch

unread,
Mar 11, 2020, 1:17:14 PM3/11/20
to Prometheus Users
    count without (cpu, mode) (node_cpu_seconds_total{mode="system"}))

is one way of counting the number of CPUs on a server, so:

    (node_load5 / count without (cpu, mode) (node_cpu_seconds_total{mode="system"})) > 1

would alert you when the (5 minute) load average is greater than the number of CPUs.

I find this alert to be noisy - your experience may differ.

--
Harald


Yagyansh S. Kumar

unread,
Mar 11, 2020, 1:37:21 PM3/11/20
to Prometheus Users
Maybe I'll refine the threshold even further but for now this works. Thanks a lot for help.

Yagyansh S. Kumar

unread,
Mar 11, 2020, 1:49:38 PM3/11/20
to Prometheus Users
I have one more small query.
If I use this expression to do my alerting, the value that I will get when I use $value in my alert will be Load per CPU Core.
How to get the actual CPU Load value itself while using the expression mentioned by you.

Julien Pivotto

unread,
Mar 11, 2020, 1:51:35 PM3/11/20
to Yagyansh S. Kumar, Prometheus Users
On 11 Mar 10:49, Yagyansh S. Kumar wrote:
> I have one more small query.
> If I use this expression to do my alerting, the value that I will get when
> I use $value in my alert will be Load per CPU Core.
> How to get the actual CPU Load value itself while using the expression
> mentioned by you.
>
> On Wednesday, March 11, 2020 at 11:07:21 PM UTC+5:30, Yagyansh S. Kumar
> wrote:
> >
> > Maybe I'll refine the threshold even further but for now this works.
> > Thanks a lot for help.
> >
> > On Wednesday, March 11, 2020 at 10:47:14 PM UTC+5:30, Harald Koch wrote:
> >>
> >>
> >>
> >> On Wed, Mar 11, 2020, at 13:00, Yagyansh S. Kumar wrote:
> >>
> >> Hi. I have configured alert for CPU Load for my servers and my current
> >> threshold is 8 for warning and 10 for critical.
> >> I want to make this threshold dynamic i.e I want the critical alert when
> >> the CPU Load becomes greater than the number of CPU Cores of the machine.
> >> Eg. For a server with 8 CPU cores, I want a critical alert when CPU load
> >> > 8 and for a machine with 16 CPU cores, I want a critical alert when CPU
> >> Load > 16.
> >>
> >>
> >> count without (cpu, mode) (node_cpu_seconds_total{mode="system"}))
> >>
> >> is one way of counting the number of CPUs on a server, so:
> >>

Hi,

you can use

node_load5 > count without (cpu, mode) (node_cpu_seconds_total{mode="system"}))

regards,

> >> (node_load5 / count without (cpu, mode)
> >> (node_cpu_seconds_total{mode="system"})) > 1
> >>
> >> would alert you when the (5 minute) load average is greater than the
> >> number of CPUs.
> >>
> >> I find this alert to be noisy - your experience may differ.
> >>
> >> --
> >> Harald
> >>
> >>
> >>
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/10f18943-1285-4354-9f90-d67d92cca7a9%40googlegroups.com.


--
(o- Julien Pivotto
//\ Open-Source Consultant
V_/_ Inuits - https://www.inuits.eu
signature.asc

Julien Pivotto

unread,
Mar 11, 2020, 1:53:08 PM3/11/20
to Yagyansh S. Kumar, Prometheus Users
On 11 Mar 18:51, Julien Pivotto wrote:
> On 11 Mar 10:49, Yagyansh S. Kumar wrote:
> > I have one more small query.
> > If I use this expression to do my alerting, the value that I will get when
> > I use $value in my alert will be Load per CPU Core.
> > How to get the actual CPU Load value itself while using the expression
> > mentioned by you.
> >
> > On Wednesday, March 11, 2020 at 11:07:21 PM UTC+5:30, Yagyansh S. Kumar
> > wrote:
> > >
> > > Maybe I'll refine the threshold even further but for now this works.
> > > Thanks a lot for help.
> > >
> > > On Wednesday, March 11, 2020 at 10:47:14 PM UTC+5:30, Harald Koch wrote:
> > >>
> > >>
> > >>
> > >> On Wed, Mar 11, 2020, at 13:00, Yagyansh S. Kumar wrote:
> > >>
> > >> Hi. I have configured alert for CPU Load for my servers and my current
> > >> threshold is 8 for warning and 10 for critical.
> > >> I want to make this threshold dynamic i.e I want the critical alert when
> > >> the CPU Load becomes greater than the number of CPU Cores of the machine.
> > >> Eg. For a server with 8 CPU cores, I want a critical alert when CPU load
> > >> > 8 and for a machine with 16 CPU cores, I want a critical alert when CPU
> > >> Load > 16.
> > >>
> > >>
> > >> count without (cpu, mode) (node_cpu_seconds_total{mode="system"}))
> > >>
> > >> is one way of counting the number of CPUs on a server, so:
> > >>
>
> Hi,
>
> you can use
>

without the extra ):

node_load5 > count without (cpu, mode) (node_cpu_seconds_total{mode="system"})

>
> regards,
>
> > >> (node_load5 / count without (cpu, mode)
> > >> (node_cpu_seconds_total{mode="system"})) > 1
> > >>
> > >> would alert you when the (5 minute) load average is greater than the
> > >> number of CPUs.
> > >>
> > >> I find this alert to be noisy - your experience may differ.
> > >>
> > >> --
> > >> Harald
> > >>
> > >>
> > >>
> >
> > --
> > You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> > To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> > To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/10f18943-1285-4354-9f90-d67d92cca7a9%40googlegroups.com.
>
>
> --
> (o- Julien Pivotto
> //\ Open-Source Consultant
> V_/_ Inuits - https://www.inuits.eu
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/20200311175128.GA3574%40oxygen.
signature.asc

Yagyansh S. Kumar

unread,
Mar 11, 2020, 1:53:09 PM3/11/20
to Prometheus Users
Thanks for the response Julien, but I have already tried the query that you have mentioned, but it doesn't work.
> To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Yagyansh S. Kumar

unread,
Mar 11, 2020, 1:59:42 PM3/11/20
to Prometheus Users
I mean it doesn't work in giving me the actual Load value.
The expression you mentioned will give be perfect in defining the threshold but the value that this expression will give will be (Actual Load Value - Number of CPU Cores).
How do I still get the actual Load value to print in the alert with threshold still being the Number of Cores.

Julien Pivotto

unread,
Mar 11, 2020, 2:01:49 PM3/11/20
to Yagyansh S. Kumar, Prometheus Users
On 11 Mar 10:59, Yagyansh S. Kumar wrote:
> I mean it doesn't work in giving me the actual Load value.
> The expression you mentioned will give be perfect in defining the threshold
> but the value that this expression will give will be (Actual Load Value -
> Number of CPU Cores).
> How do I still get the actual Load value to print in the alert with
> threshold still being the Number of Cores.


The expression I gave you will do that.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/4f7b3a71-7f0d-4d43-841f-0661134401d8%40googlegroups.com.
signature.asc

Yagyansh S. Kumar

unread,
Mar 11, 2020, 2:03:47 PM3/11/20
to Prometheus Users
Oh sorry, my bad! Yes, it does. Was comparing wrong things.
Thanks a lot!

Yagyansh S. Kumar

unread,
Mar 11, 2020, 2:18:51 PM3/11/20
to Prometheus Users
I'm sorry but one last query :P.
Is there any way I can get the number of cores also in the alert?

Eg. If in summary I want to print - The CPU Load(Value = $value) is more than the Number of cores(Can I get this value here?).
Reply all
Reply to author
Forward
0 new messages