Re: [prometheus-users] Custom Threshold for a particular instance.

32 views
Skip to first unread message

Christian Hoffmann

unread,
Jun 30, 2020, 2:14:01 AM6/30/20
to yagyans...@gmail.com, Prometheus Users
Hi,

On 6/24/20 8:09 PM, yagyans...@gmail.com wrote:
> Hi. Currently I am using a custom threshold in case of my Memory alerts.
> I have 2 main labels for my every node exporter target - cluster and
> component.
> My custom threshold till now has been based on the component as I had to
> define that particular custom threshold for all the servers of the
> component. But now, I have 5 instances, all from different components
> and I have to set the threshold as 97. How do approach this?
>
> My typical node exporter job.
>   - job_name: 'node_exporter_JOB-A'
>     static_configs:
>     - targets: [ 'x.x.x.x:9100' , 'x.x.x.x:9100']
>       labels:
>         cluster: 'Cluster-A'
>         env: 'PROD'
>         component: 'Comp-A'
>     scrape_interval: 10s
>
> Recording rule for custom thresholds.
>   - record: abcd_critical
>     expr: 99.9
>     labels:
>       component: 'Comp-A'
>
>   - record: xyz_critical
>     expr: 95
>     labels:
>       node: 'Comp-B'
>
> The expression for Memory Alert.
> ((node_memory_MemTotal_bytes - node_memory_MemFree_bytes -
> node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100) *
> on(instance) group_left(nodename) node_uname_info > on(component)
> group_left() (*abcd_critical* or *xyz_critical* or on(node) count by
> (component)((node_memory_MemTotal_bytes - node_memory_MemFree_bytes -
> node_memory_Cached_bytes) / node_memory_MemTotal_bytes * 100) * 0 + 90)
>
> Now, I have 5 servers with different components. How to include that in
> the most optimized manner?

This looks almost like the pattern described here:
https://www.robustperception.io/using-time-series-as-alert-thresholds

It looks like you already tried to integrate the two different ways to
specific thresholds, right? Is there any specific problem with it?

Sadly, this pattern quickly becomes complex, especially if nested (like
you would need to do) and if combined with an already longer query (like
in your case).

I can only suggest to try to move some of the complexity out of the
query (e.g. by moving the memory calculation to a recording rule instead).

You can also split the rule into multiple rules (with the same name).
You will just have to ensure that they only ever fire for a subset of
your instances (e.g. the first variant would only fire for
compartment-based thresholds, the second only for instance-based
thresholds).

Hope this helps.

Kind regards,
Christian

Yagyansh S. Kumar

unread,
Jul 2, 2020, 2:32:43 PM7/2/20
to Christian Hoffmann, Prometheus Users
Hi Christian,

Actually, I want to another if there is any better way to define the threshold for my 5 new servers that belong to 5 different components. Is writing 5 different recording rules with the same name, and different instance and component labels only way to proceed here? Won't that be a little too dirty to maintain? What if it was 20 servers all belonging to a different component?

Yagyansh S. Kumar

unread,
Jul 2, 2020, 2:38:08 PM7/2/20
to Christian Hoffmann, Prometheus Users
Also, currently, I have only tried a single way to give custom threshold i.e based on the component name. For example, for all the targets under Comp-A have a threshold of 99.9 and all the targets under Comp-B have a threshold of 95.
But now, I have to give a common custom threshold let say 98 to 5 different targets, all of which belong to 5 different components and all the 5 components have more than 1 target but I want the custom threshold to be applied for only a single target from each component.

sayf eddine Hammemi

unread,
Jul 3, 2020, 1:26:01 AM7/3/20
to Yagyansh S. Kumar, Christian Hoffmann, Prometheus Users
The other proper way is to dynamically generate alerts where you hardcode the thresholds based on labels.
Like using a combo of yaml/jinja to store the thresholds in a maintainable format and have one command to regenerate everything.
Every time you want to change a value you just regenerate the alerts.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAFGi5vB8S0_Gi03HSS%2BUFnQ%3DmWrWVwoBSAxJDhS3ed9r4QcTEA%40mail.gmail.com.

yagyans...@gmail.com

unread,
Jul 3, 2020, 3:12:49 AM7/3/20
to Prometheus Users
This seems like an interesting approach. If possible can you please give some more insight into this approach?
Reply all
Reply to author
Forward
0 new messages