Alerts are getting fire after every minute

188 views
Skip to first unread message

Amol Nagotkar

unread,
Feb 14, 2025, 10:53:24 AM2/14/25
to Prometheus Users
Hi all,
i want same alert(alert rule) to be fire after 5 min, currently i am getting same alert (alert rule) after every one minute for same '{{ $value }}'.
if the threshold cross and value changes, it fires multiple alerts having same alert rule thats fine. But with same '{{ $value }}' it should fire alerts after 5 min. same alert rule with same value should not get fire for next 5 min. how to get this ??
even if application is not down, it sends alerts every 1 min. how to debug this i am using below exp:- alert: "Instance Down" expr: up == 0
whats is for, keep_firing_for and evaluation_interval ?
prometheus.yml

global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

alerting:
alertmanagers:

- static_configs:
- targets:
- ip:port

rule_files:

- "alerts_rules.yml"

scrape_configs:

- job_name: "prometheus"
  static_configs:
  - targets: ["ip:port"]

alertmanager.yml
global:
resolve_timeout: 5m
route:
group_wait: 5s
group_interval: 5m
repeat_interval: 15m
receiver: webhook_receiver
receivers:

- name: webhook_receiver
  webhook_configs:
  - url: 'http://ip:port'
    send_resolved: false

alerts_rules.yml


groups:
- name: instance_alerts
  rules:
  - alert: "Instance Down"
    expr: up == 0
    # for: 30s
    # keep_firing_for: 30s
    labels:
      severity: "Critical"
    annotations:
      summary: "Endpoint {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 30 sec."

- name: rabbitmq_alerts
  rules:
    - alert: "Consumer down for last 1 min"
      expr: rabbitmq_queue_consumers == 0
      # for: 1m
      # keep_firing_for: 30s
      labels:
        severity: Critical
      annotations:
        summary: "shortify | '{{ $labels.queue }}' has no consumers"
        description: "The queue '{{ $labels.queue }}' in vhost '{{ $labels.vhost }}' has zero consumers for more than 30 sec. Immediate attention is required."


    - alert: "Total Messages > 10k in last 1 min"
      expr: rabbitmq_queue_messages > 10000
      # for: 1m
      # keep_firing_for: 30s
      labels:
        severity: Critical
      annotations:
        summary: "'{{ $labels.queue }}' has total '{{ $value }}' messages for more than 1 min."
        description: |
          Queue {{ $labels.queue }} in RabbitMQ has total {{ $value }} messages for more than 1 min.


Thank you in advance.

Brian Candler

unread,
Feb 14, 2025, 1:43:01 PM2/14/25
to Prometheus Users
> even if application is not down, it sends alerts every 1 min. how to debug this i am using below exp:- alert: "Instance Down" expr: up == 0

You need to show the actual alerts, from the Prometheus web interface and/or the notifications, and then describe how these are different from what you expect.

I very much doubt that the expression "up == 0" is firing unless there is at least one target which is not being scraped, and therefore the "up" metric has a value of 0 for a particular timeseries (metric with a given set of labels).

> if the threshold cross and value changes, it fires multiple alerts having same alert rule thats fine. But with same '{{ $value }}' it should fire alerts after 5 min. same alert rule with same value should not get fire for next 5 min. how to get this ??

I cannot work out what problem you are trying to describe. As long as you only use '{{ $value }}' in annotations, not labels, then the same alert will just continue firing.

Whether you get repeated *notifications* about that ongoing alert is a different matter. With "repeat_interval: 15m" you should get them every 15 minutes at least. You may get additional notifications if a new alert is added into the same alert group, or one is resolved from the alert group.

> whats is for, keep_firing_for and evaluation_interval ?

keep_firing_for is debouncing: once the alert condition has gone away, it will continue firing for this period of time. This is so that if the alert condition vanishes briefly but reappears, it doesn't cause the alert to be resolved and then retriggered.

evaluation_interval is how often the alerting expression is evaluated.

Amol Nagotkar

unread,
Mar 5, 2025, 2:50:23 AM3/5/25
to Prometheus Users

Thank you for the reply.


answers for above points-

1. i checked expression "up == 0" is firing rarely and all my targets are being scraped.

2. for not to get alerts every minutes, now i kept  evaluation_interval as 5m 

3. i have removed keep_firing_for as it is not suitable for my use case.


Updated:

I am using prometheus alerting for rabbitmq. Below is the configuration I am using.


prometheus.yml file

global:

  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.

  evaluation_interval: 5m # Evaluate rules every 15 seconds. The default is every 1 minute.

  # scrape_timeout is set to the global default (10s).


alerting:

   alertmanagers:

       - static_configs:

           - targets:

               - ip:port

rule_files:

- "alerts_rules.yml"

scrape_configs:

- job_name: "prometheus"

  static_configs:

  - targets: ["ip:port"]


alerts_rules.yml file

groups:

- name: instance_alerts

  rules:

  - alert: "Instance Down"

    expr: up == 0

    for: 30s

    # keep_firing_for: 30s

    labels:

      severity: "Critical"

    annotations:

      summary: "Endpoint {{ $labels.instance }} down"

      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 30 sec."


- name: rabbitmq_alerts

  rules:

    - alert: "Consumer down for last 1 min"

      expr: rabbitmq_queue_consumers == 0

      for: 30s

      # keep_firing_for: 30s

      labels:

        severity: Critical

      annotations:

        summary: "shortify | '{{ $labels.queue }}' has no consumers"

        description: "The queue '{{ $labels.queue }}' in vhost '{{ $labels.vhost }}' has zero consumers for more than 30 sec. Immediate attention is required."



    - alert: "Total Messages > 10k in last 1 min"

      expr: rabbitmq_queue_messages > 10000

      for: 30s

      # keep_firing_for: 30s

      labels:

        severity: Critical

      annotations:

        summary: "'{{ $labels.queue }}' has total '{{ $value }}' messages for more than 1 min."

        description: |

          Queue {{ $labels.queue }} in RabbitMQ has total {{ $value }} messages for more than 1 min.


Event if there is no data in queue, it sends me alerts. I have kept evaluation_interval: 5m ( Prometheus evaluates alert rules every 5 minutes) and for: 30s (Ensures the alert fires only if the condition persists for 30s).

I guess for is not working for me.

By the way i am not using alertmanager(https://github.com/prometheus/alertmanager/releases/latest/download/alertmanager-0.28.0.linux-amd64.tar.gz)

i am just using prometheus (https://github.com/prometheus/prometheus/releases/download/v3.1.0/prometheus-3.1.0.linux-amd64.tar.gz)

https://prometheus.io/download/

How can i solve this. Thank you in advance.

Brian Candler

unread,
Mar 5, 2025, 3:48:34 AM3/5/25
to Prometheus Users
You still haven't shown an example of the actual alert you're concerned about (for example, the E-mail containing all the labels and the annotations)

alertmanager cannot generate any alert unless Prometheus triggers it. Please go into the PromQL web interface, switch to the "Graph" tab with the default 1 hour time window (or less), and enter the following queries:

up == 0
rabbitmq_queue_consumers == 0
rabbitmq_queue_messages > 10000

Show the graphs.  If they are not blank, then alerts will be generated. 

"for: 30s" has no effect when you have "evaluation_interval: 5m". I suggest you use evaluation_internal: 15s (to match your scrape internal), and then "for: 30s" will have some benefit; it will only send an alert if the alerting condition has been true for two successive cycles.

Amol Nagotkar

unread,
Mar 5, 2025, 4:58:20 AM3/5/25
to Prometheus Users

Thank you for the quick reply.

So, as i told you i am not using alertmanager. i am getting alerts based on config->

alerting:

  alertmanagers:

    - static_configs:

        - targets:

          - IP_ADDRESS_OF_EMAIL_APPLICATION:PORT


written in prometheus.yml file. below is the alert response (array of object) i am receiving from prometheus.


[

  {

    annotations: {

      description: 'Queue QUEUE_NAME in RabbitMQ has total 1.110738e+06 messages\n' +

        'for more than 1 minutes.\n',

      summary: "RabbitMQ Queue 'QUEUE_NAME' has more than 10L messages"

    },

    endsAt: '2025-02-03T06:33:31.893Z',

    startsAt: '2025-02-03T06:28:31.893Z',

    generatorURL: 'http://helo-container-pr:9091/graph?g0.expr=rabbitmq_queue_messages+%3E+1e%2B06&g0.tab=1',

    labels: {

      alertname: 'Total Messages > 10L in last 1 min',

      instance: 'IP_ADDRESS:15692',

      job: 'rabbitmq-rcs',

      queue: 'QUEUE_NAME',

      severity: 'critical',

      vhost: 'webhook'

    }

  }

]



If i keep evaluation_internal: 15s, it started triggering every minute.

I want alerts to be trigger after 5 min and only if condition is true.

Message has been deleted

Brian Candler

unread,
Mar 5, 2025, 1:13:02 PM3/5/25
to Prometheus Users
I notice that your "up == 0" graph shows lots of green which are values where up == 0. These are legitimately generating alerts, in my opinion. If you have set evaluation_interval to 5m, and "for:" to be less than 5m, then a single instance of up == 0 will send an alert, because that's what you asked for.

> I want alerts to be trigger after 5 min and only if condition is true.


Then you want:

evaluation_interval: 15s  # on the rule group, or globally
for: 5m   # on the individual alerting rule(s)

Then an alert will only be sent if alert condition has been present consecutively for the whole 5 minutes (i.e. 20 cycles).

Finally: you may find it helpful to include {{ $value }} in an annotation on each alerting rule, so you can tell the value which triggered the alert. I can see you've done this already in one of your alerts:

   - alert: "Total Messages > 10k in last 1 min"
      expr: rabbitmq_queue_messages > 10000
...

      annotations:
        summary: "'{{ $labels.queue }}' has total '{{ $value }}' messages for more than 1 min."

And this is reflected in the alert:

      description'Queue QUEUE_NAME in RabbitMQ has total 1.110738e+06 messages\n' +

        'for more than 1 minutes.\n',

      summary"RabbitMQ Queue 'QUEUE_NAME' has more than 10L messages"


rabbitmq_queue_messages is a vector containing zero or more instances of that metric.

rabbitmq_queue_messages > 10000 is a reduced vector, containing only those instance of the metric with a value greater than 10000.

You can see that the $value at the time the alert was generated was 1.110738e+06, which is 1,110,738, and that's clearly a lot more than 10,000. Hence you get an alert. It's what you asked for.

If you want a more readable string in the annotation, you can use {{ $value | humanize }}, but it will lose some precision.

On Wednesday, 5 March 2025 at 10:28:15 UTC Amol Nagotkar wrote:
As u can see in below images
Last trigger was at 15:31:29
And receive emails after that time also, which is for example 15:35, 15:37, etc. 
IMG-20250305-WA0061.jpg

IMG-20250305-WA0060.jpg

Amol Nagotkar

unread,
Mar 6, 2025, 12:23:02 AM3/6/25
to Prometheus Users
Thanks for the reply. 

1. when i keep evaluation_interval: 5m and for: 30s -> i get alerts every 5 min. (those alerts gets store in prometheus and triggers every 5 min, i mean even if condition is not matching, i still used to get alerts every 5min)


now i am changing config to below:-

evaluation_interval: 15s  # on the rule group, or globally

for: 5m   # on the individual alerting rule(s)

i will update you about this soon.


2. If you want a more readable string in the annotation, you can use {{ $value | humanize }}, but it will lose some precision.

This is serious concern for us. how to solve this?


Amol Nagotkar

unread,
Mar 6, 2025, 4:04:17 AM3/6/25
to Prometheus Users

one more imp thing,

why do i receive same {{ $value }}  alerts again and again. In rabbitmq, it is possible to get different values, but same value not possible always. but i receive alerts having same value many times.

Brian Candler

unread,
Mar 6, 2025, 9:32:10 AM3/6/25
to Prometheus Users
On Thursday, 6 March 2025 at 05:23:02 UTC Amol Nagotkar wrote:

2. If you want a more readable string in the annotation, you can use {{ $value | humanize }}, but it will lose some precision.

This is serious concern for us. how to solve this?



How to solve what??

If you want a more readable value in the alert, like "1.111M", use {{ $value | humanize }}

If you want the exact value, use {{ $value }}

Brian Candler

unread,
Mar 6, 2025, 9:36:12 AM3/6/25
to Prometheus Users
Maybe you also have a scraping/data collection problem. I saw all those outages in your "up == 0" graph - that's not good. Fix that first.

The value of a metric is whatever the last value that was scraped, up to a maximum look-back time which defaults to 5 minutes.

You will be able to determine this by entering your alerting expression directly into the PromQL web browser and then switching to "graph" mode. That is, enter

rabbitmq_queue_messages > 10000

into the query box. The graph will show lines wherever alerts would be generated. If the lines are horizontal, then either the same metric value was being returned in each scrape (perhaps it was cached on the rabbitmq side?) or else the scraping failed (look at the corresponding "up" metric for the job which collects that data)

This is stuff that you'll have to sort out on your own system, I'm afraid.

Amol Nagotkar

unread,
Mar 6, 2025, 10:00:13 AM3/6/25
to Prometheus Users

Thank you for your reply. 

with only this config:-

scrape_interval: 15s

evaluation_interval: 15s

And

for: 5m  # in rules. 

you can see image below,

there is an event at 18:11:54 having value 117.  But i  received total 13 alerts(emails) having value 117

Screenshot from 2025-03-06 20-05-41.png


Amol Nagotkar

unread,
Mar 6, 2025, 10:01:45 AM3/6/25
to Prometheus Users
IMG_20250306_202426.jpg

Brian Candler

unread,
Mar 6, 2025, 2:49:01 PM3/6/25
to Prometheus Users
The alerts you are now showing are for rabbitmq_queue_messages, which is not one of the alerting rules that you showed before, so this problem is a moving target that I can't help you with.

Alerts repeating every minute can be a symptom of labels which are changing (although the graphs don't show that). Or it could be something else that you're not showing, like alertmanager clustering.

Try setting these on your alerting rules.

    for: 5m
    keep_firing_for: 5m

I will leave it like that now. It's something odd about how *your* system is configured, which is different from standard configuration. Good luck with your investigations.

On Thursday, 6 March 2025 at 15:01:45 UTC Amol Nagotkar wrote:
IMG_20250306_202426.jpg

Amol Nagotkar

unread,
Mar 12, 2025, 9:28:24 AM3/12/25
to Prometheus Users

Thanks for the reply

can you please help me what config should i use?

with this config:-


prometheus.yml

scrape_interval: 15s

evaluation_interval: 15s


And


alerts_rules.yml

for: 5m  # in rules.


as per prometheus graph for my expression-

start time for expression condition matched 2025-03-12 15:24:47

end time for expression condition matched 2025-03-12 15:31.59


BUT


first alert receive at 2025-03-12 15:29:57

last alert receive at2025-03-12 15:47:12

Reply all
Reply to author
Forward
0 new messages