Alert stuck in Pending state, never fires

220 views
Skip to first unread message

Lena

unread,
Jun 20, 2023, 4:24:57 AM6/20/23
to Prometheus Users
Hello,
I hope you can help me with the issue I faced.
I use disk_usage_exporter to get metrics about database sizes. The metrics are gathered by Prometheus each 5 min. The servicemonitor configuration is:
  - interval: 300s
   metricRelabelings:
   - action: replace
     regex: node_disk_usage_bytes
     replacement: database_disk_usage_bytes
     sourceLabels:
     - __name__
     targetLabel: __name__
   path: /metrics
   port: disk-exporter
   relabelings:
   - action: replace
     regex: (.+)-mysql-slave
     replacement: $1
     sourceLabels:
     - service
     targetLabel: cluster
   scrapeTimeout: 120s
Then I have an alert to notify if some database has size of more than 15GB for 24hours:
    - alert: MySQLDatabaseSize
     expr: database_disk_usage_bytes > 15 * 1024 * 1024 * 1024
     for: 24h
     labels:
       severity: warning
     annotations:
       dashboard: database-disk-usage?var-cluster={{ $labels.cluster }}
       description: MySQL database `{{ $labels.path |reReplaceAll "/var/lib/mysql/" "" }}` takes `{{ $value | humanize }}` of disk space on pod `{{ $labels.pod }}`
       summary: MySQL database has grown too big.

On testing environment the alert fires properly. However on production environment it never fires, being stuck in Pending state, as `Active Since` time is being updated every ~5min.
The only difference between environments is the number of databases in cluster.
Below you can see screenshot of `Active Since` time, you see that time changes:
active_since1.pngactive_since2.png
The metric labels are not changing. The graph is stable, so there are no missed metrics or gaps where database size is not defined.
graph.png

Scrape time takes ~20-40sec, however it's still within scrapeTimeout: 120sec

The rule evaluation takes 1-2sec with evaluation_interval: 30sec

Prometheus version is 2.22.1

I see no related errors in Prometheus logs and have no clue what can be the reason of the issue.

Thank you for any advise.

Julius Volz

unread,
Jun 20, 2023, 6:28:54 AM6/20/23
to Lena, Prometheus Users
Hi Lena,

One thing I see is that your scrape interval is very long: 300s, which is exactly 5 minutes. The lookback delta of an instant vector selector is also exactly 5 minutes (see https://www.youtube.com/watch?v=xIAEEQwUBXQ&t=272s), which means that the selector will stop returning a result if there is ever a case where there is no datapoint at least 5 minutes prior to the current rule evaluation timestamp. That would reset the "for" duration again. With a 5-minute scrape interval, that can indeed happen to you at times (either just a bit of a delay in scraping or in ingesting scraped samples, or even an occasional failed scrape). I'd recommend setting the interval short enough that you can tolerate an occasional failed scrape (like 2m). Does the problem go away with a shorter interval?

By the way: 24h is quite a long "for" duration. If the series is ever absent for an even longer period during those 24h (like if the exporter is down for a couple of minutes), your alerts will always reset again. An alternative could be to alert on an expression like "min_over_time(database_disk_usage_bytes[24h]) > 15 * 1024 * 1024 * 1024" with a much shorter "for" duration. But some "for" duration is still a good idea, in the case of a fresh Prometheus server that doesn't have 24h of data yet. That way, the alert would become less reliant on perfect scrape / exporter behavior over a full 24h window.

Regards,
Julius

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/4a53fee8-f73f-452e-af2c-f903d6fb8215n%40googlegroups.com.


--
Julius Volz
PromLabs - promlabs.com

Lena

unread,
Jun 20, 2023, 7:14:56 AM6/20/23
to Prometheus Users
Hello Julius,

Thank you a lot for your reply.
When I checked information about lookback delta previously I assumed that the graph should also show the missing results. So, if there is no datapoint it will be shown as a gap on graph. And graph showed non-interrupted line, so I did not consider checking it now. I see that I could be wrong. 
I also did not consider using "min_over_time" expression previously, while it looks useful.

I will definitely try suggested changes.
Thank you again.
Have a great time of the day.

Julius Volz

unread,
Jun 27, 2023, 11:38:12 AM6/27/23
to Lena, Prometheus Users
On Tue, Jun 20, 2023 at 1:15 PM Lena <moho...@gmail.com> wrote:
Hello Julius,

Thank you a lot for your reply.
When I checked information about lookback delta previously I assumed that the graph should also show the missing results. So, if there is no datapoint it will be shown as a gap on graph. And graph showed non-interrupted line, so I did not consider checking it now. I see that I could be wrong. 

This is possible, since the visibility of gaps would depend on the exact alignment of the evaluation timestamp of the rule (or the evaluation step in the graph) relative to the latest sample before that.
 
Reply all
Reply to author
Forward
0 new messages