Guidance on Prometheus Alerting for Shutdown Instances

92 views
Skip to first unread message

Tim B.

unread,
Dec 6, 2023, 5:05:54 AM12/6/23
to Prometheus Users
Hello everyone,

I'm relatively new to Prometheus, so your patience is much appreciated.

I'm facing an issue and seeking guidance:

I'm working with a metric like CPU usage, where instance identifiers are submitted as labels. To ensure instances are running as expected, I've defined an alert based on this metric. The alert triggers when the aggregation value (in my case, the increase) over a time window falls below an expected threshold. By utilizing the instance identifier as a label, I've streamlined the alert definition to one.

So far, I've been successful in achieving this. However, I'm grappling with how to handle instances that have been intentionally shut down. Since the metric value for these instances remains static, the alert consistently fires.

How can I address this challenge? Did I make a fundamentally flawed modeling decision? Any insights would be greatly appreciated.

Brian Candler

unread,
Dec 6, 2023, 8:28:55 AM12/6/23
to Prometheus Users
Is there a metric from which you can determine whether a particular instance has been "intentionally shut down"? If so, you can use a join between the metrics in your PromQL alert. e.g.

    expr:  increase(foo[5m]) < 1 unless on (instance) adminShutdown == 1

(Aside: this is not a boolean expression. if/and/unless are set union and intersection operators. The LHS is a vector of alerts; the RHS is also a vector, filtered down to only to those timeseries where the value is 1; the "unless" operator suppresses all vector elements in the LHS where there's a matching set of labels on the RHS. in this case considering only the "instance" label because that's what the "on" clause specifies)

If you don't already have a metric you can use for this, then maybe you need to create one. This could be done on each target - for example using the node_exporter textfile collector, drop a file like this into the collector directory:

adminShutdown 0

Scraping will add the 'instance' label for you.

Or globally - e.g. create a list of metrics describing the state of each instance, put it on a HTTP server, and scrape it in its own scrape job with "honor_labels: true" to prevent the instance labels being overridden.

adminShutdown{instance="foo"} 0
adminShutdown{instance="bar"} 1
adminShutdown{instance="baz"} 0

Don't worry about the few extra timeseries this will create. Prometheus compresses timeseries extremely well, especially where scrapes give repeated identical values.

Chris Siebenmann

unread,
Dec 6, 2023, 9:46:03 AM12/6/23
to Tim B., Prometheus Users, Chris Siebenmann
> I'm working with a metric like CPU usage, where instance identifiers
> are submitted as labels. To ensure instances are running as expected,
> I've defined an alert based on this metric. The alert triggers when
> the aggregation value (in my case, the increase) over a time window
> falls below an expected threshold. By utilizing the instance
> identifier as a label, I've streamlined the alert definition to one.
>
> So far, I've been successful in achieving this. However, I'm grappling
> with how to handle instances that have been intentionally shut down.
> Since the metric value for these instances remains static, the alert
> consistently fires.

I think it may depend on how you're collecting these metrics. In
general, the best way to collect per-instance metrics is to have
Prometheus directly scrape them from a target that will stop responding
or go away if the instance does. When a scrape fails, Prometheus
immediately marks all metrics it supplies as stale, and I believe that
this also happens when a scrape target is removed (for example through
service discovery not listing it any more). Metrics that are known to be
stale no longer show up in rate() and other things, so normally they
automatically don't trigger such alerts (and any active alert for such a
target will go away when it's removed).

If you're collecting these metrics in a way that makes them stuck after
the instance goes away (the classical case is publishing them through
Pushgateway), then either you need an additional 'is this instance
alive' check in your alerts or you need some additional system to delete
metrics from now-removed instance from wherever they're getting
published. If you have control over the complete metrics you're
publishing, one option is to publish a last-updated metric and then only
alert if the last-updated metric is recent enough. In many cases you can
arrange for this metric to have the same labels as your other metrics,
so you can just add something like 'and ((time() - metric) < 120)' to
your alert rule. If the labels are different, you'll need to get more
creative.

(Conveniently Pushgateway already provides such a metric for each group,
in 'push_time_seconds'. However it may not have all the labels as the
metric you're alerting on.)

- cks
Reply all
Reply to author
Forward
0 new messages