Is it possible to extract labels when generating AlertManager alert ?

Sébastien Dionne

unread,

Jun 25, 2020, 2:55:44 PM6/25/20

to Prometheus Users

I have few java applications that I'll deploy in my cluster. I need to know how can I detect if a instance is up or down with Prometheus.

Alerting with AlertManager

I have a alert that check for "instanceDown" and send a alert to AlertManager-webhook. So when one instance is down, i'm receiving alerts in my application.

But how can I extract the labels that are in that instance ?

ex : I have a special labels in all my application that link the pod to the information that I have in the database

releaseUUIDGroup=bf79b8ab-a7c1-4d27-8f3c-6e0f0a089c70

there is a way to add that information in the message that AlertManager send ?

right now I configure AlertManager to send the alert to : https://webhook.site/#!/815a0b0b-f40c-4fc2-984d-e29cb9606840/b0dd701d-e972-48d4-9083-385e6a788d55/1

for an example, I kill the pod : prometheus-pushgateway

and I received this message :

{
  "receiver": "default-receiver",
  "status": "resolved",
  "alerts": [
    {
      "status": "resolved",
      "labels": {
        "alertname": "InstanceDown",
        "instance": "prometheus-pushgateway.default.svc:9091",
        "job": "prometheus-pushgateway",
        "severity": "page"
      },
      "annotations": {
        "description": "prometheus-pushgateway.default.svc:9091 of job prometheus-pushgateway has been down for more than 1 minute.",
        "summary": "Instance prometheus-pushgateway.default.svc:9091 down"
      },
      "startsAt": "2020-06-19T17:09:53.862877577Z",
      "endsAt": "2020-06-22T11:23:53.862877577Z",
      "generatorURL": "http://prometheus-server-57d8dcc67f-qnmkj:9090/graph?g0.expr=up+%3D%3D+0&g0.tab=1",
      "fingerprint": "1ed4a1dca68d64fb"
    }
  ],
  "groupLabels": {},
  "commonLabels": {
    "alertname": "InstanceDown",
    "instance": "prometheus-pushgateway.default.svc:9091",
    "job": "prometheus-pushgateway",
    "severity": "page"
  },
  "commonAnnotations": {
    "description": "prometheus-pushgateway.default.svc:9091 of job prometheus-pushgateway has been down for more than 1 minute.",
    "summary": "Instance prometheus-pushgateway.default.svc:9091 down"
  },
  "externalURL": "http://localhost:9093",
  "version": "4",
  "groupKey": "{}:{}"
}

Christian Hoffmann

unread,

Jun 30, 2020, 4:15:58 AM6/30/20

to Sébastien Dionne, Prometheus Users

Hi,

On 6/25/20 8:55 PM, Sébastien Dionne wrote:
> I have few java applications that I'll deploy in my cluster. I need to
> know how can I detect if a instance is up or down with Prometheus.
>

> *Alerting with AlertManager*
> *
> *

> I have a alert that check for "instanceDown" and send a alert to
> AlertManager-webhook. So when one instance is down, i'm receiving alerts
> in my application.
>
> But how can I extract the labels that are in that instance ?

What do you mean by "in that instance"?

If the label is part of your service discovery, then it should be
attached to all series from that target. This would also imply that it
would be part of any alert by default unless you aggregate it away (e.g.
by using sum, avg or something).

If the label is only part of some info-style metric, you will have to
mix this metric into your alert.

Can you share one of the relevant alert rules if you need more specific
guidance?

Note: I don't know how many releaseUUIDGroups you have, but having UUIDs
as label values might ring some alarm bells due to the potential for
high cardinality issues. :)

Kind regards,
Christian

> --
> You received this message because you are subscribed to the Google
> Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to prometheus-use...@googlegroups.com
> <mailto:prometheus-use...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/20ec33e0-e9bf-4f2a-b366-092743dad957o%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/20ec33e0-e9bf-4f2a-b366-092743dad957o%40googlegroups.com?utm_medium=email&utm_source=footer>.

Sébastien Dionne

unread,

Jun 30, 2020, 7:34:00 AM6/30/20

to Prometheus Users

that is the config that I have so far

serverFiles:
alerts:
groups:
- name: Instances
rules:
- alert: InstanceDown
expr: up == 0
for: 10s
labels:
severity: page
annotations:
description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute.'
summary: 'Instance {{ $labels.instance }} down'

alertmanagerFiles:
alertmanager.yml:
route:
receiver: default-receiver
group_wait: 5s
group_interval: 10s

receivers:
- name: default-receiver
webhook_configs:
- url: "https://webhook.site/815a0b0b-f40c-4fc2-984d-e29cb9606840"

here a exemple of one of my pods

Labels: app.kubernetes.io/instance=optimizer-6e0f0a089c70

app.kubernetes.io/name=optimizer-interface

pod-template-hash=784669954d

releaseUUIDGroup=bf79b8ab-a7c1-4d27-8f3c-6e0f0a089c70

service.ip=10.1.7.200

Annotations: cni.projectcalico.org/podIP: 192.168.218.99/32

cni.projectcalico.org/podIPs: 192.168.218.99/32

prometheus.io/path: /metrics

prometheus.io/port: 8080

prometheus.io/scrape: true

I have to get Prometheus to scan for pod "health" each 10-15 seconds and send a alert for the pods that are up->down and down -> up

on the side, I added a Gauge that return the timestamp in my application and I pool Prometheus each 15 seconds to get the last timestamp of all application and if the NOW - timestamp> 15, that means that Prometheus wasn't able to call the pod in the last 15 seconds.. so I consider that pod down. With a query like that

http://localhost:9090/api/v1/query?query={__name__=~".*lastTimestampScrapped"}

but if I could do the same directly with Prometheus+alertManager, I wouldn't have to query manually Prometheus myself.

Sébastien Dionne

unread,

Jun 30, 2020, 9:40:46 AM6/30/20

to Prometheus Users

YES.. when I have labels on my pods.. I received them. good. I think, I'll be able to work with AlertManager webhook.

Prometheus auto-discover my pods because they are annoted with

prometheus.io/path: /metrics

prometheus.io/port: 8080

prometheus.io/scrape: true

but there is a way to configure the scrape interval with annoation too ?

I could have applications that we want to monitor each 15 sec and others at 45sec interval or more.

thanks

Brian Candler

unread,

Jul 1, 2020, 3:11:21 AM7/1/20

to Prometheus Users

On Tuesday, 30 June 2020 14:40:46 UTC+1, Sébastien Dionne wrote:

but there is a way to configure the scrape interval with annoation too ?

I could have applications that we want to monitor each 15 sec and others at 45sec interval or more.

You can have two different scrape jobs, one with interval 15s and one with interval 45s. Use the relabeling step to drop targets which have the wrong annotation for that job.

Reply all

Reply to author

Forward