Prometheus alert tagging issue - multiple servers

216 views
Skip to first unread message

mohan garden

unread,
Apr 2, 2024, 12:15:39 PM4/2/24
to Prometheus Users
Dear Prometheus Community,
I am reaching out regarding an issue i have encountered with  prometheus alert tagging, specifically while creating tickets in Opsgenie.


I have configured alertmanager  to send alerts to Opsgenie as , the configuration as :
photo001.pngi ticket is generated with expected description and tags as - 
photo002.png

Now, by default the alerts are grouped by the alert name( default behavior).So when the similar event happens on a different server i see that the description is updated as:
photo003.png
but the tag on the ticket remains same, 
expected behavior: criteria=..., host=108, host=114, infra.....support 

I have set update_alert and send_resolved settings to true.
I am not sure that in order to make it work as expected, If i need additional configuration at opsgenie or at the alertmanager. 

I would appreciate any insight or guidance on the method to resolve this issue and ensure that alerts for different servers are correctly tagged in Opsgenie.

Thank you in advance.
Regards,
CP

Brian Candler

unread,
Apr 2, 2024, 1:16:36 PM4/2/24
to Prometheus Users
FYI, those images are unreadable - copy-pasted text would be much better.

My guess, though, is that you probably don't want to group alerts before sending them to opsgenie. You haven't shown your full alertmanager config, but if you have a line like

   group_by: ['alertname']

then try

   group_by: ["..."]

(literally, exactly that: a single string containing three dots, inside square brackets)

mohan garden

unread,
Apr 3, 2024, 4:07:12 AM4/3/24
to Prometheus Users
Hi Brian, 
Thank you for the response, Here are some more details, hope this will help you in gaining more understanding into the configuration and method i am using to generate tags :


1. We collect data from the node exporter, and have created some rules around the collected data. Here is one example - 
    - alert: "Local Disk usage has reached 50%"
      expr: (round( node_filesystem_avail_bytes{mountpoint=~"/dev.*|/sys*|/|/home|/tmp|/var.*|/boot.*",} / node_filesystem_size_bytes{mountpoint=~"/dev.*|/sys*|/|/home|/tmp|/var.*|/boot.*"} * 100  ,0.1) >= y ) and (round( node_filesystem_avail_bytes{mountpoint=~"/dev.*|/sys*|/|/home|/tmp|/var.*|/boot.*"} / node_filesystem_size_bytes{mountpoint=~"/dev.*|/sys*|/|/home|/tmp|/var.*|/boot.*"} * 100  ,0.1) <= z )
      for: 5m
      labels:
        criteria: overuse
        severity: critical
        team: support

      annotations:
        summary: "{{ $labels.instance }} 's  ({{ $labels.device }}) has low space."
        description: "space on {{ $labels.mountpoint }} file system at {{ $labels.instance }} server = {{ $value }}%."


2. at the alert manager , we have created notification rules to notify in case the aforementioned condition occurs:

  smtp_from: 'ser...@example.com'
  smtp_require_tls: false
  smtp_smarthost: 'ser...@example.com:25'

templates:
  - /home/ALERTMANAGER/conf/template/*.tmpl

route:
  group_wait: 5m
  group_interval: 2h
  repeat_interval: 5h
  receiver: admin
  routes:
  - match_re:
      alertname: ".*Local Disk usage has reached .*%"
    receiver: admin
    routes:
    - match:
        criteria: overuse
        severity: critical
        team: support

      receiver: mailsupport
      continue: true
    - match:
        criteria: overuse
        team: support
        severity: critical
        receiver: opsgeniesupport


receivers:
  - name: opsgeniesupport
    opsgenie_configs:
    - api_key: XYZ
      api_url: https://api.opsgenie.com
      message: '{{ .CommonLabels.alertname }}'
      description: "{{ range .Alerts }}{{ .Annotations.description }}\n\r{{ end }}"
      tags: '{{ range $k, $v := .CommonLabels}}{{ if or (eq $k "criteria")  (eq $k "severity") (eq $k "team") }}{{$k}}={{$v}},{{ else if eq $k "instance" }}{{ reReplaceAll "(.+):(.+)" "host=$1" $v }},{{end}}{{end}},infra,monitor'
      priority: 'P1'
      update_alerts: true
      send_resolved: true

...
So you can see that i derive a  tag host=<hostname> from the instance label.


Scenario1: When server1 's local disk usage reaches 50%, i see that Opsgenie ticket is created having:
Opsgenie Ticket metadata: 
ticket header name:  local disk usage reached 50%
ticket description:  space on /var file system at server1:9100 server = 82%."
ticket tags: criteria: overuse , team: support, severity: critical, infra,monitor,host=server1

so everything works as expected, no issues with Scenario1.


Scenario2: While server1 trigger is active, a second server ( say server2)'s local disk usage reaches 50%,

i see that Opsgenie tickets are getting updated as:
ticket header name:  local disk usage reached 50%
ticket description:  space on /var file system at server1:9100 server = 82%."
ticket description:  space on /var file system at server2:9100 server = 80%."
ticket tags: criteria: overuse , team: support, severity: critical, infra,monitor,host=server1


but i was expecting an additional host=server2 tag on the ticket. 
in Summary - i see updated description , but unable to see updated tags.

in tags section of the alertmanager - opsgenie integration configuration , i had tried iterating over Alerts and CommonLabels, but i was unable to add  additional host=server2 tag .
{{ range $idx, $alert := .Alerts}}{{range $k, $v := $alert.Labels }}{{$k}}={{$v}},{{end}}{{end}},test=test
{{ range $k, $v := .CommonLabels}}....{{end}}



At the moment, i am not sure that what is potentially preventing the update of tags on the opsgenie tickets.
If i can get some clarity on the fact that if the configurations i have for  alertmanager are good enough, then i can look at the opsgenie configurations.


Please advice.


Regards
CP


mohan garden

unread,
Apr 3, 2024, 4:11:24 AM4/3/24
to Prometheus Users
*correction: 

Scenario2: While server1 trigger is active, a second server ( say server2)'s local disk usage reaches 50%,

i see that the already open Opsgenie ticket's details gets updated as:

ticket header name:  local disk usage reached 50%
ticket description:  space on /var file system at server1:9100 server = 82%."
                                 space on /var file system at server2:9100 server = 80%."
ticket tags: criteria: overuse , team: support, severity: critical, infra,monitor,host=server1

photo003.png

Brian Candler

unread,
Apr 3, 2024, 8:14:17 AM4/3/24
to Prometheus Users
> but i was expecting an additional host=server2 tag on the ticket. 

You won't get that, because CommonLabels is exactly how it sounds: those labels which are common to all the alerts in the group.  If one alert has instance=server1 and the other has instance=server2, but they're in the same alert group, then no 'instance' will appear in CommonLabels.

The documentation is here:

It looks like you could iterate over Alerts.Firing then the Labels within each alert.

Alternatively, you could disable grouping and let opsgenie do the grouping (I don't know opsgenie, so I don't know how good a job it would do of that)

mohan garden

unread,
Apr 3, 2024, 11:01:21 AM4/3/24
to Prometheus Users
Thank you for the pointers. I tried - 
tags: '{{ range .Alerts.Firing }} {{ range .Labels.SortedPairs }}  {{ .Name }}={{ .Value }}, {{ end }} {{end}}'

but did not see any change in the outcome. 
i see all the tags (alertname, job,instance, ...) -  but only from the first Alert, the tags from the second alert did not show up.

Is there a way i can see the entire message which alert manager sends out to the Opsgenie? - somewhere in the alertmanager logs or a text file?
That would help me to understand whether the alert manager is sending  all the tags  and its the Opsgenie which may be dropping those extra tags.

Regards
CP

Puneet Singh

unread,
Apr 3, 2024, 11:34:40 AM4/3/24
to Prometheus Users
UPDATE:
i had a look at the https://docs.opsgenie.com/docs/alert-api#add-tags-to-alert
using the following API - 

curl -X POST https://api.opsgenie.com/v2/alerts/<alert id>/tags?identifierType=id -H "Content-Type: application/json" -H "Authorization: GenieKey <api key>" -d '{ "tags": ["host=testserver","instance=testserver123"], "user":"Monitoring Script", "note":"Action executed via Alert API" }' I was able to append additional tags to the existing opsgenie tickets.
photo004.png
So i think there is no restriction from the Opsgenie's end . The tag update issue should be taken care by the Alertmanager's opsgenie plugin.
Not sure that internally how the Alert manager sends tags section information to the opsgenie API when new alerts (part of same alert group) come in .

Brian Candler

unread,
Apr 3, 2024, 12:29:06 PM4/3/24
to Prometheus Users
On Wednesday 3 April 2024 at 16:01:21 UTC+1 mohan garden wrote:
Is there a way i can see the entire message which alert manager sends out to the Opsgenie? - somewhere in the alertmanager logs or a text file?

You could try setting api_url to point to a webserver that you control.

mohan garden

unread,
Jul 27, 2024, 11:39:57 AM7/27/24
to Prometheus Users
Hi Brian,
Thank you for the suggestion,
I was able to setup a flask application to monitor the data sent by alert manager for opsgenie using api_url end point.
I had to create 3 end points
1. POST for - /
2. PUT for /v2/alerts/message
3. PUT for  /v2/alerts/description


POST:
{'alias': '<mangled>71c5c169a773796b467cc741f70457c4', 'message': 'Type1 Server is down or node exporter is unreachable', 'description': 'server1:9100 server is down or prometheus is unable to query the node exporter service which should be up and running.\n\rserver2:9100 server is down or prometheus is unable to query the node exporter service which should be up and running.\n\r', 'details': {'SERVER_CATEGORY': 'Type1', 'SERVER_SITE': 'ind', 'alertname': 'Type1 Server is down or node exporter is unreachable', 'criteria': 'nodedown', 'job': 'default_nodeexporters', 'severity': 'critical', 'team': 'infrasupport'}, 'source': 'http://alertmanager:9093/#/alerts?receiver=opsgenie_support', 'tags': ['SERVER_CATEGORY=Type1', 'SERVER_SITE=ind', 'criteria=nodedown', 'severity=critical', 'team=support', 'support', 'monitor', 'server1:9100', 'server2:9100'], 'priority': 'P1'}
10.73.6.210 - - [27/Jul/2024 07:32:04] "POST /v2/alerts HTTP/1.1" 200 -

First PUT:
{'message': 'Utility Server is down or node exporter is unreachable'}
10.73.6.210 - - [27/Jul/2024 07:32:04] "PUT /v2/alerts/<mangled>71c5c169a773796b467cc741f70457c4/message?identifierType=alias HTTP/1.1" 200 -

Second PUT:
{'description': 'server1:9100 server is down or prometheus is unable to query the node exporter service which should be up and running.\n\rserver2:9100 server is down or prometheus is unable to query the node exporter service which should be up and running.\n\r'}
10.73.6.210 - - [27/Jul/2024 07:32:04] "PUT /v2/alerts/<mangled>71c5c169a773796b467cc741f70457c4/description?identifierType=alias HTTP/1.1" 200 -

It seems the alert manager needs to send another PUT request for updating the opsgenie tags.

mohan garden

unread,
Jul 27, 2024, 11:57:24 AM7/27/24
to Prometheus Users

I plan to disable the grouping only for opsgenie routes and for specific set of alerts. Here is the example of current alert manager configuration - 
Example -

route:
  group_wait: 5m
  group_interval: 5m
  repeat_interval: 7h

  receiver: admin
  routes:
  - match_re:
      alertname: ".* Type1 Server is down.* "
    receiver: admingroup2
    routes:
    - match:

        team: support
        severity: critical
      receiver: opsgeniesupport
      group_wait: 1m
      group_interval: 5m
      repeat_interval: 6h
      continue: true
    - match:
        team: support
        severity: critical
      receiver: mailsupport
      group_wait: 1m
      group_interval: 1h
      repeat_interval: 12h

Q1:   Is is possible to disable the grouping for specific type of alerts ( Say  Type1 keyword in  alert manager) only for opsgenie route?  I am looking for something like -

    - match:

        team: support
        severity: critical
      receiver: opsgeniesupport
      group_by: [instance]
      group_wait: 1m
      group_interval: 5m
      repeat_interval: 6h
      continue: true
    - match:
        team: support
        severity: critical
      receiver: mailsupport
         group_by: [instance]
      group_wait: 1m
      group_interval: 1h
      repeat_interval: 12h
Is this allowed by Alert Manager?


Q2:  Is it possible to change the alert name from the prometheus before prometheus dispatches alert to the alert manager?
- alert: "Type1 down or process monitoring service is unreachable"
      expr: up{ SERVER_CATEGORY='Type1'  } == 0
      for: 2m
      labels:

        severity: critical
        team: support
      annotations:
        summary: "{{ $labels.instance }} is not reachable"
        description: "{{ $labels.instance }} is not reachable"

    - alert: " Type1 down or process monitoring service is unreachable   - {{ $labels.instance}} " 

Hopefully this will help me as i am unable to get the appropriate tags in opsgenie using grouping.
Having host name tag will be helpful and we can know via JIRA integration that how many incidents have occured for a host in past.

Regards
MG

Brian Candler

unread,
Jul 27, 2024, 1:21:56 PM7/27/24
to Prometheus Users
Q1 - yes, each route can have separate group_by section, as shown in the documentation:

Note that if you do
group_by: [instance]
then you'll get one Opsgenie alert group for an instance, even if there are multiple problems with that instance. If you want to disable grouping completely, put a string with three dots between the square brackets:
group_by: ['...']

Q2 - I don't see why you want to put {{ $labels.instance }} in the alert name. It's then no longer the name of the alert, it's a combination of the name of the alert and the name of the instance; and to analyze the data by instance you'd have to parse it out of the alert.

Put it in the alert description instead.

> Having host name tag will be helpful and we can know via JIRA integration that how many incidents have occured for a host in past.

Surely it would be better to do this is analysis with alert labels, and from what I can see of the POST content you showed, Opsgenie calls these "tags" rather than "labels".

> It seems the alert manager needs to send another PUT request for updating the opsgenie tags.

Are you saying that the problem is that Alertmanager isn't updating the tags? But if these tags come from CommonLabels, and the alerts are part of a group, then the CommonLabels are by definition those which are common to all the alerts in the group.

It seems to me that there are two meaningful alternatives. Either:
1. multiple alerts from Prometheus are in the same group (in which case, it's a single alert as far as Opsgenie is concerned, and the tags are the labels common to all alerts in the group); or
2. you send separate alerts from Prometheus, each with their own tags, and then you analyze and/or group them Opsgenie-side.

If host-by-host incident analysis is what you want, then option (2) seems to be the way to go.

What version of Alertmanager are you running? Looking in the changelogs I don't see any particular recent changes, and I notice you're already using "update_alerts: true", but I thought it was worth checking.

## 0.25.0 / 2022-12-22

* [ENHANCEMENT] Support templating for Opsgenie's responder type. #3060

## 0.24.0 / 2022-03-24

* [ENHANCEMENT] Add `update_alerts` field to the OpsGenie configuration to update message and description when sending alerts. #2519
* [ENHANCEMENT] Add `entity` and `actions` fields to the OpsGenie configuration. #2753
* [ENHANCEMENT] Add `opsgenie_api_key_file` field to the global configuration. #2728
* [ENHANCEMENT] Add support for `teams` responders to the OpsGenie configuration. #2685

## 0.22.0 / 2021-05-21

* [ENHANCEMENT] OpsGenie: Propagate labels to Opsgenie details. #2276
Reply all
Reply to author
Forward
0 new messages