Defining Prometheus alerts with different thresholds per node

42 views
Skip to first unread message

LabTest Diagnostics

unread,
Jul 1, 2020, 8:17:12 PM7/1/20
to Prometheus Users
I've written some alerts for memory usage (for windows nodes) that look like this:

expr: 100 * (windows_os_physical_memory_free_bytes) / (windows_cs_physical_memory_bytes) < 70

Currently, any server that exceeds 70% of available mem should give us an alert. This doesn't work for me as there are some nodes that consistently clock over 80% of the memory.

Is there a way to specify the threshold levels for alerts on a instance basis?

Christian Hoffmann

unread,
Jul 2, 2020, 8:32:30 AM7/2/20
to LabTest Diagnostics, Prometheus Users
Hi,

On 7/2/20 2:17 AM, LabTest Diagnostics wrote:
> I've written some alerts for memory usage (for windows nodes) that look
> like this:
>
> |
> expr:100*(windows_os_physical_memory_free_bytes)/(windows_cs_physical_memory_bytes)<70
> |
>
> Currently, any server that exceeds 70% of available mem should give us
> an alert. This doesn't work for me as there are some nodes that
> consistently clock over 80% of the memory.
>
> Is there a way to specify the threshold levels for alerts on a instance
> basis?

Yes, you can use time series as thresholds:
https://www.robustperception.io/using-time-series-as-alert-thresholds

Kind regards
Christian

LabTest Diagnostics

unread,
Jul 2, 2020, 1:00:48 PM7/2/20
to Prometheus Users
Hello Christian,
For my use case this looks right? (need more clarity on what the "something" is, in the alert block)


 groups
:
- name: MemoryAlert
  rules
:
 
- record: Memory_Usage_Too_High
    expr
: 100*(windows_os_physical_memory_free_bytes)/(windows_cs_physical_memory_bytes)<90
    labels
:
      instance
: Server1, Server2
 
- alert: MemoryUsageTooHigh
    expr
: |
     
# Alert based on per-team thresholds.
        something
#what is this?
     
> on (instance) group_left
       
(
           
Memory_Usage_Too_High
         
or on (instance)
            count
by (instance)(something) * 0 + 70 #For all other instances/server memory usage shouldn't exceed 70%
       
)  


Thank you!
Reply all
Reply to author
Forward
0 new messages