hundreds of containers, how to alert when a certain container is down?

109 views

Skip to first unread message

Sleep Man

unread,

May 18, 2024, 2:50:48 PM5/18/24

to Prometheus Users

I have a large number of containers. I learned that the following configuration can monitor a single container down. How to configure it to monitor all containers and send the container name once a container is down.

- name: containers
rules:
- alert: jenkins_down
expr: absent(container_memory_usage_bytes{name="jenkins"})
for: 30s
labels:
severity: critical
annotations:
summary: "Jenkins down"
description: "Jenkins container is down for more than 30 seconds."

Brian Candler

unread,

May 18, 2024, 5:01:49 PM5/18/24

to Prometheus Users

Monitoring for a metric vanishing is not a very good way to do alerting. Metrics hang around for the "staleness" interval, which by default is 5 minutes. Ideally, you should monitor all the things you care about explicitly, get a success metric like "up" (1 = working, 0 = not working) and then alert on "up == 0" or equivalent. This is much more flexible and timely.

Having said that, there's a quick and dirty hack that might be good enough for you:

expr: container_memory_usage_bytes offset 10m unless container_memory_usage_bytes

This will give you an alert if any metric container_memory_usage_bytes existed 10 minutes ago but does not exist now. The alert will resolve itself after 10 minutes.

The result of this expression is a vector, so it can alert on multiple containers at once; each element of the vector will have the container name in the label ("name")

Reply all

Reply to author

Forward

0 new messages