Understanding prometheus talking to the alert manager

628 views
Skip to first unread message

Jesús Roncero

unread,
Feb 10, 2017, 12:43:04 PM2/10/17
to Prometheus Users
Hi,

I'm testing a prometheus installation connected to consul for service
discovery and I'm bumping into a situation that has me confused and
would like to understand.

I have defined a very simple alert that checks that the node_exporter is
running on the hosts that I get from consul.

The alert is:

ALERT InstanceDownTest
IF up{job="worker"} == 0
FOR 1m
LABELS {
severity="critical"
}
ANNOTATIONS {
summary = "Test: Instance {{$labels.consul_service_address}} is down",
description = "Test: {{$labels.consul_service_address}} of job
{{$labels.job}} has been down for more than 1 minutes.",
}

The alert manager is configured like:

group_by: ['alertname', 'instance', 'consul_service']
group_wait: 30s
group_interval: 1m
repeat_interval: 3m

And so if I bring down one of the services on the remote hosts, the
alerts get triggered in FIRING state. Repetitions happen every 4 minutes
which I still don't uderstand (is that because of group_interval +
repeat_interval?) but I can live with for the moment.

What's confusing me is that when I fix the node_exporter, prometheus
seems to start firing away to the alert manager saying that the alert
has been fixed, and this lasts for exactly 15 minutes, where you can see:

DEBU[1779] Received alert alert=InstanceDownTest[7cbd568][resolved]
component=dispatcher source=dispatch.go:168
DEBU[1784] Received alert alert=InstanceDownTest[7cbd568][resolved]
component=dispatcher source=dispatch.go:168

meaning that every 5 seconds prometheus is sending an alert
status=resolved to the alert manager for the next 15 minutes. During
that time, the alert manager generates 4 alerts, at 0 (since fixing), at
4 minutes, at 8 and at 12 minutes. Prometheus keeps sending data to the
alert manager until 15 minutes, but the alert manager ignores it.

On the UI, on the /alerts section, all the alerts remain in green state,
but it nevertheless keeps talking to it.

Question, is this expected? If so, can this be configured? (or am I
missing something here?)

Many thanks.
--
Jesús Roncero

Jesús Roncero

unread,
Feb 10, 2017, 1:06:56 PM2/10/17
to promethe...@googlegroups.com
Right, I think I found it myself:

// resolvedRetention is the duration for which a resolved alert instance
// is kept in memory state and consequentally repeatedly sent to the
AlertManager.
const resolvedRetention = 15 * time.Minute

from rules/alerting.go

Never mind :).

Thanks.
Reply all
Reply to author
Forward
0 new messages