hundreds of containers, how to alert when a certain container is down?

109 views
Skip to first unread message

Sleep Man

unread,
May 18, 2024, 2:50:48 PM5/18/24
to Prometheus Users
I have a large number of containers. I learned that the following configuration can monitor a single container down. How to configure it to monitor all containers and send the container name once a container is down.


- name: containers
  rules:
  - alert: jenkins_down
    expr: absent(container_memory_usage_bytes{name="jenkins"})
    for: 30s
    labels:
      severity: critical
    annotations:
      summary: "Jenkins down"
      description: "Jenkins container is down for more than 30 seconds."

Brian Candler

unread,
May 18, 2024, 5:01:49 PM5/18/24
to Prometheus Users
Monitoring for a metric vanishing is not a very good way to do alerting. Metrics hang around for the "staleness" interval, which by default is 5 minutes. Ideally, you should monitor all the things you care about explicitly, get a success metric like "up" (1 = working, 0 = not working) and then alert on "up == 0" or equivalent. This is much more flexible and timely.

Having said that, there's a quick and dirty hack that might be good enough for you:

    expr: container_memory_usage_bytes offset 10m unless container_memory_usage_bytes

This will give you an alert if any metric container_memory_usage_bytes existed 10 minutes ago but does not exist now. The alert will resolve itself after 10 minutes.

The result of this expression is a vector, so it can alert on multiple containers at once; each element of the vector will have the container name in the label ("name")
Reply all
Reply to author
Forward
0 new messages