Jesús Roncero
unread,Feb 10, 2017, 12:43:04 PM2/10/17Sign in to reply to author
Sign in to forward
You do not have permission to delete messages in this group
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to Prometheus Users
Hi,
I'm testing a prometheus installation connected to consul for service
discovery and I'm bumping into a situation that has me confused and
would like to understand.
I have defined a very simple alert that checks that the node_exporter is
running on the hosts that I get from consul.
The alert is:
ALERT InstanceDownTest
IF up{job="worker"} == 0
FOR 1m
LABELS {
severity="critical"
}
ANNOTATIONS {
summary = "Test: Instance {{$labels.consul_service_address}} is down",
description = "Test: {{$labels.consul_service_address}} of job
{{$labels.job}} has been down for more than 1 minutes.",
}
The alert manager is configured like:
group_by: ['alertname', 'instance', 'consul_service']
group_wait: 30s
group_interval: 1m
repeat_interval: 3m
And so if I bring down one of the services on the remote hosts, the
alerts get triggered in FIRING state. Repetitions happen every 4 minutes
which I still don't uderstand (is that because of group_interval +
repeat_interval?) but I can live with for the moment.
What's confusing me is that when I fix the node_exporter, prometheus
seems to start firing away to the alert manager saying that the alert
has been fixed, and this lasts for exactly 15 minutes, where you can see:
DEBU[1779] Received alert alert=InstanceDownTest[7cbd568][resolved]
component=dispatcher source=dispatch.go:168
DEBU[1784] Received alert alert=InstanceDownTest[7cbd568][resolved]
component=dispatcher source=dispatch.go:168
meaning that every 5 seconds prometheus is sending an alert
status=resolved to the alert manager for the next 15 minutes. During
that time, the alert manager generates 4 alerts, at 0 (since fixing), at
4 minutes, at 8 and at 12 minutes. Prometheus keeps sending data to the
alert manager until 15 minutes, but the alert manager ignores it.
On the UI, on the /alerts section, all the alerts remain in green state,
but it nevertheless keeps talking to it.
Question, is this expected? If so, can this be configured? (or am I
missing something here?)
Many thanks.
--
Jesús Roncero