Alert for stopped containers

279 views
Skip to first unread message

Tamar

unread,
Mar 8, 2021, 9:39:31 AM3/8/21
to Prometheus Users
Hi,

I am trying to create an alert for stopped containers.

If I am using the exact container name I have no problem:

 - alert: ContainerKilled
    expr:  absent(container_start_time_seconds{name="be-dev-4"})
    for: 15m
    labels:
      severity: 'warning'
    annotations:
      summary: 'Container killed'
      description: 'A container{{ $labels.name }} has disappeared'

However, if i am trying to use regexp for the container name (as I have a few containers with this suffix) , then it fails whatever I try - 
If I use this, then no alert is sent:
  - alert: ContainerKilled
    expr:  absent(container_start_time_seconds{ name=~".*dev-4"})
    for: 15m
    labels:
      severity: 'warning'
    annotations:
      summary: 'Container killed'
      description: 'A container{{ $labels.name }} has disappeared'

If I use this, then alert is sent, but without the stopped container name:
 - alert: ContainerKilled2
    expr:  absent(container_start_time_seconds{name=~".*dev-4"})
    for: 15m
    labels:
      severity: 'warning'
    annotations:
      summary: 'Container killed'
      description: 'A container has disappeared {{ $labels.instance }} of job {{ $labels.job }}'

Any idea how to alert then with a regexp, and the container name?

Thanks

M S

unread,
Mar 9, 2021, 10:22:31 AM3/9/21
to Prometheus Users
How in PromQL can I ignore entries with values less than some amount?

Charls P John

unread,
Mar 9, 2021, 11:12:47 AM3/9/21
to M S, Prometheus Users
Hi ms,

I'm sorry, but by entries, did you mean time serieses ? Then simply write the query as:
metric_name{...} < value

On Tue, Mar 9, 2021, 20:52 'M S' via Prometheus Users <promethe...@googlegroups.com> wrote:
How in PromQL can I ignore entries with values less than some amount?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/05857BE7-3C3E-4F83-99BF-C7346AE34122%40yahoo.com.

Charls P John

unread,
Mar 9, 2021, 11:14:06 AM3/9/21
to M S, Prometheus Users
Sorry the other way around
metric_name{...} > value

Matthias Rampke

unread,
Mar 11, 2021, 1:59:59 AM3/11/21
to Tamar, Prometheus Users
The fundamental problem is how Prometheus can know which containers should be there. Considering your regex, there is an infinite number of containers that are "absent": 0dev-4, 1dev-4, … 9999dev-4, …fjdhrhfksnhdev-4 etc.

To solve this, you need a list of concretely expected containers somewhere. That could be separate alerts if the number is small, or some metric that is there even when the container is stopped. In that case you can use the unless operator:

all_expected_containers unless on(name) container_start_time_seconds

If there is not already such a metric, you could generate it using recording rules (again requires listing them out but is less verbose), write a small exporter that gets the data from your source of truth, or use

container_start_time_seconds offset 15m

to look for containers that have been running before and now are not. The downside of this is that it is noisy when a container is expected to go away, and these alerts "resolve" after 15m whether the container is back up or not.

/MR


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages