alertmanager - Resolved message issue

sangjae lee

unread,

Oct 12, 2020, 3:58:03 AM10/12/20

to Prometheus Users

when some problems occured, prometheus firing alerts and alertmanager send message, and firing problems are not cleared, but resolve message is always send after 5 minutes.

firing problem is still exist, but why alertmanager send resolved message?

is this bug??

1. i check firing message Ends At time and this time is update for every 1 minutes. (using amtool)

2. i check prometheus resend firing message

Brian Candler

unread,

Oct 12, 2020, 5:42:10 AM10/12/20

to Prometheus Users

Show your alerting rule.

One possibility is that the labels of the alert are changing. Alerts with different labels are treated as different alerts, and therefore the alert with the original set of labels will be considered resolved.

> is this bug??

What versions of prometheus and alertmanager are you running?

Are you running any sort of HA setup for alertmanager?

sangjae lee

unread,

Oct 12, 2020, 8:00:06 PM10/12/20

to Prometheus Users

> What versions of prometheus and alertmanager are you running?

prometheus version: 2.20.1

alertmanager version: 0.21.0

> Are you running any sort of HA setup for alertmanager?

no

Here is my config

1. rules.yml

groups:

- name: dockermonitoring

rules:

# Alert for any docker container that is unreachable for >5 seconds.

- alert: ContainerKilled

expr: time() - container_last_seen{id=~"/docker/.*"} > 5

for: 5s

labels:

severity: critical

annotations:

summary: "Container killed (instance {{ $labels.instance }})"

description: "A container has disappeared\n VALUE = {{ $value }}\n LABELS: {{ $labels }}"

2. alertmanager.yml

global:

resolve_timeout: 1h

route:

receiver: 'prometheus-msteams'

group_by: ['name']

group_wait: 1s

group_interval: 30s

repeat_interval: 1h

receivers:

- name: 'prometheus-msteams'

webhook_configs:

- url: 'http://promteams:2000/alertmanager' # the prometheus-msteams proxy

send_resolved: true

3. prometheus.yml

# my global config

global:

scrape_interval: 10s # Set the scrape interval to every 15 seconds. Default is every 1 minute.

evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

query_log_file: /opt/opencafe/monitoring/prometheus/logs/query.log

# scrape_timeout is set to the global default (10s).

.....

2020년 10월 12일 월요일 오후 6시 42분 10초 UTC+9에 b.ca...@pobox.com님이 작성:

Brian Candler

unread,

Oct 13, 2020, 3:28:27 AM10/13/20

to Prometheus Users

Thank you. You originally said:

> and firing problems are not cleared, but resolve message is always send after 5 minutes.

It sounds to me like this is a staleness issue. That is: the container_last_seen{...} metric which triggered the alert is no longer present in scrapes. The PromQL rule evaluation only looks back 5 minutes in time to find a data point. Anything older than that is not found.

When you have an PromQL expression like this:

expr: foo > 5

it's really a chained filter:

(1) "foo" filters down to just metrics with __name__="foo"

(2) "> 5" further filters down to just metrics where the current value is > 5

The alert then fires if the filter returns one or more timeseries; and if a particular timeseries triggered an alert, but subsequently vanishes, then it is considered to be resolved.

If a particular timeseries hasn't been seen in a scrape for more than 5 minutes, then it won't be returned in step (1).

That's my best guess at what's going on. To prove or disprove this, go into the PromQL browser in the web interface and enter

container_last_seen{id=~"/docker/.*"}[10m]

This will show you the raw datapoints (values and timestamps) over the last 10 minutes for that metric. If a given timeseries stopped being scraped, then you'll see no more data points added. So the last value scraped will be able to trigger an alert, but only for 5 minutes, until it becomes stale.

sangjae lee

unread,

Oct 15, 2020, 9:17:03 PM10/15/20

to Prometheus Users

thx for your reply.

I test yesterday using grafana explore and below result:

- if running docker container is stopped or killed, 'container_last_seen' data is only valid 5 minutes

> i guess after 5 minutes, stopped or killed container data is deleted, that's why resolve message is always send after 5 minutes.

> i research how to extend this 5 minutes and try so many modify config, testing... retry... but this is impossible.

so, i test PromQL expression 'count(rate(container_last_seen{id=~"/docker/.*"}[1m])) < 10'

this expression is exactly fine working.

but this expression can't present instance name, docker id... only present count value, so i can't know what docker instance is exactly down .

i really want solve this issue.

when docker instance down, firing and catch immediately, after docker instance is restart, resolve message is exactly comming.

2020년 10월 13일 화요일 오후 4시 28분 27초 UTC+9에 b.ca...@pobox.com님이 작성:

Brian Candler

unread,

Oct 16, 2020, 3:25:54 AM10/16/20

to Prometheus Users

On Friday, 16 October 2020 02:17:03 UTC+1, sangjae lee wrote:

i really want solve this issue.
when docker instance down, firing and catch immediately, after docker instance is restart, resolve message is exactly comming.

If you *know* which docker instances should be there, then you can write a query using absent(). With "or" you can write an expression with alerts on a threshold *or* the value is completely missing. See: