Is it possible to extract labels when generating AlertManager alert ?

78 views
Skip to first unread message

Sébastien Dionne

unread,
Jun 25, 2020, 2:55:44 PM6/25/20
to Prometheus Users
I have few java applications that I'll deploy in my cluster.  I need to know how can I detect if a instance is up or down with Prometheus. 

Alerting with AlertManager

I have a alert that check for "instanceDown" and send a alert to AlertManager-webhook. So when one instance is down, i'm receiving alerts in my application.  

But how can I extract the labels that are in that instance ?  

ex : I have a special labels in all my application that link the pod to the information that I have in the database

releaseUUIDGroup=bf79b8ab-a7c1-4d27-8f3c-6e0f0a089c70


there is a way to add that information in the message that AlertManager send ?


for an example, I kill the pod : prometheus-pushgateway

and I received this message : 

{
  "receiver": "default-receiver",
  "status": "resolved",
  "alerts": [
    {
      "status": "resolved",
      "labels": {
        "alertname": "InstanceDown",
        "instance": "prometheus-pushgateway.default.svc:9091",
        "job": "prometheus-pushgateway",
        "severity": "page"
      },
      "annotations": {
        "description": "prometheus-pushgateway.default.svc:9091 of job prometheus-pushgateway has been down for more than 1 minute.",
        "summary": "Instance prometheus-pushgateway.default.svc:9091 down"
      },
      "startsAt": "2020-06-19T17:09:53.862877577Z",
      "endsAt": "2020-06-22T11:23:53.862877577Z",
      "generatorURL": "http://prometheus-server-57d8dcc67f-qnmkj:9090/graph?g0.expr=up+%3D%3D+0&g0.tab=1",
      "fingerprint": "1ed4a1dca68d64fb"
    }
  ],
  "groupLabels": {},
  "commonLabels": {
    "alertname": "InstanceDown",
    "instance": "prometheus-pushgateway.default.svc:9091",
    "job": "prometheus-pushgateway",
    "severity": "page"
  },
  "commonAnnotations": {
    "description": "prometheus-pushgateway.default.svc:9091 of job prometheus-pushgateway has been down for more than 1 minute.",
    "summary": "Instance prometheus-pushgateway.default.svc:9091 down"
  },
  "externalURL": "http://localhost:9093",
  "version": "4",
  "groupKey": "{}:{}"
}

Christian Hoffmann

unread,
Jun 30, 2020, 4:15:58 AM6/30/20
to Sébastien Dionne, Prometheus Users
Hi,

On 6/25/20 8:55 PM, Sébastien Dionne wrote:
> I have few java applications that I'll deploy in my cluster.  I need to
> know how can I detect if a instance is up or down with Prometheus. 
>
> *Alerting with AlertManager*
> *
> *
> I have a alert that check for "instanceDown" and send a alert to
> AlertManager-webhook. So when one instance is down, i'm receiving alerts
> in my application.  
>
> But how can I extract the labels that are in that instance ?
What do you mean by "in that instance"?

If the label is part of your service discovery, then it should be
attached to all series from that target. This would also imply that it
would be part of any alert by default unless you aggregate it away (e.g.
by using sum, avg or something).

If the label is only part of some info-style metric, you will have to
mix this metric into your alert.

Can you share one of the relevant alert rules if you need more specific
guidance?

Note: I don't know how many releaseUUIDGroups you have, but having UUIDs
as label values might ring some alarm bells due to the potential for
high cardinality issues. :)

Kind regards,
Christian
> --
> You received this message because you are subscribed to the Google
> Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to prometheus-use...@googlegroups.com
> <mailto:prometheus-use...@googlegroups.com>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/20ec33e0-e9bf-4f2a-b366-092743dad957o%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/20ec33e0-e9bf-4f2a-b366-092743dad957o%40googlegroups.com?utm_medium=email&utm_source=footer>.

Sébastien Dionne

unread,
Jun 30, 2020, 7:34:00 AM6/30/20
to Prometheus Users
that is the config that I have so far


serverFiles:
  alerts:
    groups:
      - name: Instances
        rules:
          - alert: InstanceDown
            expr: up == 0
            for: 10s
            labels:
              severity: page
            annotations:
              description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minute.'
              summary: 'Instance {{ $labels.instance }} down'
              
alertmanagerFiles:
  alertmanager.yml:
    route:
      receiver: default-receiver
      group_wait: 5s
      group_interval: 10s

    receivers:
      - name: default-receiver
        webhook_configs:
              

here a exemple of one of my pods

          
              pod-template-hash=784669954d
              releaseUUIDGroup=bf79b8ab-a7c1-4d27-8f3c-6e0f0a089c70
              service.ip=10.1.7.200

              prometheus.io/path: /metrics
              prometheus.io/port: 8080
              prometheus.io/scrape: true

I have to get Prometheus to scan for pod "health" each 10-15 seconds and send a alert for the pods that are up->down  and down -> up


on the side, I added a Gauge that return the timestamp in my application and I pool Prometheus each 15 seconds to get the last timestamp of all application and if the NOW - timestamp> 15, that means that Prometheus wasn't able to call the pod in the last 15 seconds.. so I consider that pod down.  With a query like that


but if I could do the same directly with Prometheus+alertManager, I wouldn't have to query manually Prometheus myself.

Sébastien Dionne

unread,
Jun 30, 2020, 9:40:46 AM6/30/20
to Prometheus Users
YES.. when I have labels on my pods.. I received them.  good.  I think, I'll be able to work with AlertManager webhook.


Prometheus auto-discover my pods because they are annoted with
              prometheus.io/port: 8080
              prometheus.io/scrape: true


but there is a way to configure the scrape interval with annoation too ?

I could have applications that we want to monitor each 15 sec and others at 45sec interval or more.



thanks 

Brian Candler

unread,
Jul 1, 2020, 3:11:21 AM7/1/20
to Prometheus Users
On Tuesday, 30 June 2020 14:40:46 UTC+1, Sébastien Dionne wrote:
but there is a way to configure the scrape interval with annoation too ?

I could have applications that we want to monitor each 15 sec and others at 45sec interval or more.


You can have two different scrape jobs, one with interval 15s and one with interval 45s.  Use the relabeling step to drop targets which have the wrong annotation for that job.
Reply all
Reply to author
Forward
0 new messages