I am running Prometheus to monitor system resources like memory and CPU usage, as well as other services on the infrastructure. I rely on Alertmanager to send alerts to Telegram whenever a specific issue occurs (such as high memory usage or a service stopping).
The problem I'm facing is that Alertmanager is not sending a notification when an issue is resolved.
High CPU Usage: If CPU usage exceeds 70%.
High Memory Usage: If memory usage exceeds 85%.
Service Stopped: If a service stops working.
Alerts are sent to Alertmanager, which then sends notifications via Telegram when an issue arises.
The initial alert messages are received correctly when the problem occurs. However, when the system returns to a normal state and the issue is "resolved," Alertmanager does not send a notification indicating that the problem has been resolved.
Instead of sending a "Resolved" message when the issue is fixed, I notice that the same alert message is repeated (the one for the issue), rather than receiving a message indicating that the issue has been resolved.
Current Configuration:
Prometheus Configuration (file alerts.yml):
groups:
name: CPU Usage Alert
rules:
name: Memory Usage Alert
rules:
Alertmanager Configuration (file alertmanager.yml):
global:
resolve_timeout: 5m
route:
receiver: telegram_receiver
group_by: ["alertname", "Host"]
group_wait: 15s
group_interval: 15s
repeat_interval: 24h
routes:
receivers:
I would greatly appreciate any guidance or solutions to this issue.
- annotations:
summary: "High CPU usage on {{ $labels.Host }} for {{ $labels.Client }} ({{ $value }})"
description: "CPU usage on {{ $labels.Host }} for {{ $labels.Client }} has exceeded 70% for 5 minutes."
resolved: "CPU usage on {{ $labels.Host }} for {{ $labels.Client }} is back to normal ({{ $value }})."