Alert keep firing for ~2 minutes after it resolved

154 views
Skip to first unread message

m.n...@gmail.com

unread,
Nov 5, 2018, 5:38:52 AM11/5/18
to Prometheus Users
Hi

I'm getting resolved alarms in delay ~2 minutes.
I run Prometheus and Alertmanager with
 --log.level=debug
And I noticed that although the alarm is resolved (in Prometheus dashboard), Alertmanager keeps getting that the alert is firing, so it keeps firing the alert.
i.e, logs from Alertmanager:
level=debug ts=2018-11-05T09:11:22.971655651Z caller=dispatch.go:445 component=dispatcher aggrGroup={}:{} msg=Flushing alerts=[highCpuUsage[342bfd3][active]]
level=debug ts=2018-11-05T09:11:23.579966911Z caller=dispatch.go:201 component=dispatcher msg="Received alert" alert=highCpuUsage[342bfd3][active]
level=debug ts=2018-11-05T09:11:23.971970436Z caller=dispatch.go:445 component=dispatcher aggrGroup={}:{} msg=Flushing alerts=[highCpuUsage[342bfd3][active]]

Maybe it is a configuration issue (attached bellow).
I'm using Prometheus version 2.4.3 and Alertmanager version 0.15.2 .

prometheus.yml
# my global config
global:
  scrape_interval:     1s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 1s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets: ['localhost:9093']
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
   - "cpu_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'alertmanager'
    static_configs:
      - targets: ['localhost:9093']

  - job_name: 'cpu'
    static_configs:
    - targets: ['localhost:9177']




cpu_rules.yml - alert
groups:
- name: lab
  rules:
  - alert: highCpuUsage
    expr: rate(libvirt_cpu_stats_user_time_nanosecs[1m])/10000000 > 40
    labels:
      severity: critical
    annotations:
      title: "High cpu usage"
      description: |
        "High cpu usage on instance"




alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['state']
  group_wait: 1s
  group_interval: 1s
  repeat_interval: 1h
  receiver: admin_user
receivers:
- name: 'admin_user'
  webhook_configs:
    send_resolved: true
    http_config:
      basic_auth:
        username: 'admin'
        password: 'admin'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']


Thanks

Simon Pasquier

unread,
Nov 5, 2018, 6:14:55 AM11/5/18
to Muhamad Najjar, promethe...@googlegroups.com
You've defined 1s as the rule evaluation interval and no "for:"
clause: it might be that your alert is constantly flapping between
inactive and firing.
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To post to this group, send email to promethe...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/1b65f750-dd4c-481b-9454-87050f261341%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

m.n...@gmail.com

unread,
Nov 5, 2018, 8:21:31 AM11/5/18
to Prometheus Users
No, it is not constantly flapping between inactive and firing.
I removed "for" because I don't want the alert to transit first to pending and then firing. I need it to fire right away.
So, what it the proper way to set evaluation_interval it i wan't to get alerts right away?

Simon Pasquier

unread,
Nov 5, 2018, 8:26:46 AM11/5/18
to Muhamad Najjar, promethe...@googlegroups.com
On Mon, Nov 5, 2018 at 2:21 PM <m.n...@gmail.com> wrote:
>
> No, it is not constantly flapping between inactive and firing.

Can you share the AlertManagers logs with debug level to confirm? Does
AlertManager receive the alert as resolved?

> I removed "for" because I don't want the alert to transit first to pending and then firing. I need it to fire right away.
> So, what it the proper way to set evaluation_interval it i wan't to get alerts right away?

It depends what you mean by "right away": there will still be a lag
which is 1s with your current setting.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/3e93fec5-fbd7-4b3b-a4fd-b3acfd7eee04%40googlegroups.com.

m.n...@gmail.com

unread,
Nov 5, 2018, 8:46:34 AM11/5/18
to Prometheus Users
yes, the Alermanager receives the alert as resolved but in delay .
I know it is delayed because I see in Prometheus dashboard that alarm is resolved,
but I still get in the logs that it is firing
logs are here: Alertmanager logs

1 sec of lag still ok.
I mean in "right away" is that whenever there the condition matches, I need to get an alert.

Simon Pasquier

unread,
Nov 5, 2018, 11:22:34 AM11/5/18
to Muhamad Najjar, promethe...@googlegroups.com
Hmm indeed, something is broken since Prometheus is sending the EndsAt
field [1] for firing alerts.
On the AlertManager side, it tries to merge identical (label-wise)
alerts with overlapping StartsAt/EndsAt. The start time is always the
min value from both alerts while the end time is the max value from
both alerts. This doesn't work well when Prometheus sends EndsAt
because the resolved alert's end time is usually less that the value
of the firing alert...

[1] https://github.com/prometheus/prometheus/pull/4550
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/85ea76ca-ff9e-4afc-81b2-1b1cddfdda3b%40googlegroups.com.

m.n...@gmail.com

unread,
Nov 5, 2018, 12:02:48 PM11/5/18
to Prometheus Users
I'm not sure I understand what you are trying to say.
Could you explain more ?
How can I solve the delay ?

Simon Pasquier

unread,
Nov 5, 2018, 12:11:29 PM11/5/18
to Muhamad Najjar, promethe...@googlegroups.com
On Mon, Nov 5, 2018 at 6:02 PM <m.n...@gmail.com> wrote:
>
> I'm not sure I understand what you are trying to say.
> Could you explain more ?
> How can I solve the delay ?

I still need some confirmation but it looks like it is a bug between
Prometheus/AlertManager.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/3c543766-a7d5-447e-830e-29aa7828e3f1%40googlegroups.com.

m.n...@gmail.com

unread,
Nov 6, 2018, 3:59:54 AM11/6/18
to Prometheus Users
I tried to run Prometheus 2.3.2 with the same configuration and there was no delay with sending resolved alerts!
Reply all
Reply to author
Forward
0 new messages