Issue with resolved alerts not sending notifications

69 views
Skip to first unread message

mohammad md

unread,
Oct 23, 2024, 11:26:30 AM10/23/24
to Prometheus Users

I am running Prometheus to monitor system resources like memory and CPU usage, as well as other services on the infrastructure. I rely on Alertmanager to send alerts to Telegram whenever a specific issue occurs (such as high memory usage or a service stopping).

The problem I'm facing is that Alertmanager is not sending a notification when an issue is resolved.
High CPU Usage: If CPU usage exceeds 70%.
High Memory Usage: If memory usage exceeds 85%.
Service Stopped: If a service stops working.
Alerts are sent to Alertmanager, which then sends notifications via Telegram when an issue arises.

The initial alert messages are received correctly when the problem occurs. However, when the system returns to a normal state and the issue is "resolved," Alertmanager does not send a notification indicating that the problem has been resolved.

Instead of sending a "Resolved" message when the issue is fixed, I notice that the same alert message is repeated (the one for the issue), rather than receiving a message indicating that the issue has been resolved.

Current Configuration:
Prometheus Configuration (file alerts.yml):
groups:

  • name: CPU Usage Alert
    rules:

    • alert: HighCPUUsage
      expr: ceil(100 * (1 - (avg by (Host, Client) (rate(node_cpu_seconds_total{mode="idle"}[5m]))))) > 70
      for: 6m
      labels:
      severity: Critical
      Host: "{{ $labels.Host }}"
      Client: "{{ $labels.Client }}"
      annotations:
      summary: "High CPU usage on {{ $labels.Host }} for {{ $labels.Client }} ({{ $value }})"
      description: "CPU usage on {{ $labels.Host }} for {{ $labels.Client }} has exceeded 70% for 5 minutes."
      resolved: "CPU usage on {{ $labels.Host }} for {{ $labels.Client }} is back to normal ({{ $value }})."
  • name: Memory Usage Alert
    rules:

    • alert: HighMemory
      expr: floor(1 - (avg(node_memory_MemAvailable_bytes) by (Client, Host) / avg(node_memory_MemTotal_bytes) by (Client, Host))) * 100 > 85
      for: 6m
      labels:
      severity: Critical
      Host: "{{ $labels.Host }}"
      Client: "{{ $labels.Client }}"
      annotations:
      summary: "High Memory usage on {{ $labels.Host }} for {{ $labels.Client }} ({{ $value }})"
      description: "Memory usage on {{ $labels.Host }} for {{ $labels.Client }} has exceeded 85% for 5 minutes."
      resolved: "Memory usage on {{ $labels.Host }} for {{ $labels.Client }} is back to normal ({{ $value }}%)."

Alertmanager Configuration (file alertmanager.yml):
global:
resolve_timeout: 5m
route:
receiver: telegram_receiver
group_by: ["alertname", "Host"]
group_wait: 15s
group_interval: 15s
repeat_interval: 24h
routes:

  • receiver: 'telegram_receiver'
    matchers:
    • severity="Critical"

receivers:

  • name: 'telegram_receiver'
    telegram_configs:
    • api_url: 'https://api.telegram.org'
      send_resolved: true
      bot_token: xxxxxxxxxxxxxx
      chat_id: yyyyyyyyyyyyyyyyyyyyyyyyy
      message: '{{ range .Alerts }}Alert⚠️: {{ printf "%s\n" .Labels.alertname }}{{ printf "%s\n" .Annotations.summary }}{{ printf "%s\n" .Annotations.description }}{{ end }}'
      parse_mode: 'HTML'

I would greatly appreciate any guidance or solutions to this issue.

Brian Candler

unread,
Oct 24, 2024, 9:34:45 AM10/24/24
to Prometheus Users
On Wednesday 23 October 2024 at 16:26:30 UTC+1 mohammad md wrote:
    • annotations:
      summary: "High CPU usage on {{ $labels.Host }} for {{ $labels.Client }} ({{ $value }})"
      description: "CPU usage on {{ $labels.Host }} for {{ $labels.Client }} has exceeded 70% for 5 minutes."
      resolved: "CPU usage on {{ $labels.Host }} for {{ $labels.Client }} is back to normal ({{ $value }})."
I'm afraid that's not how alerting works in Prometheus.

In Prometheus, you write a PromQL expression which returns an instant vector, containing zero or more values that generate alerts. For example:

expr: foo > 0.9

The PromQL expression "foo" returns a vector of all metrics with name "foo". The expression "foo > 0.9" returns the same vector but filtered down to only those timeseries whose value is > 0.9

If the vector is empty, there's no alert. If the vector is non-empty, then one or more alerts are active.

So, suppose you have:

foo{instance="a"} 0.8
foo{instance="b"} 0.93
foo{instance="c"} 0.7
foo{instance="d"} 0.99

then the expression foo > 0.9 will return

[
foo{instance="b"} 0.93
foo{instance="d"} 0.99
]

and $value will be 0.93 or 0.99 respectively.

Now, let's say foo{instance="b"} drops to (say) 0.85, then the expression value becomes

[
foo{instance="d"} 0.99
]

and when foo{instance="d"} becomes 0.6 then you'll get an empty vector

[
]

And at this point, the alert is resolved. But there's no concept of "a normal value" for $value, because there is no $value at all, because the vector is empty.  All you get is the absence of an alert.  This means the "resolve" message, if it references $value at all, will show the last value which generated an alert.

That's simply how it works. "foo > 0.9" is not a boolean test, it's a filter, and $value will only show values which are passed through the filter; all other values are dropped.
Reply all
Reply to author
Forward
0 new messages