Prometheus alert tagging issue

mohan garden

unread,

Apr 2, 2024, 12:15:39 PM4/2/24

to Prometheus Users

Dear Prometheus Community,

I am reaching out regarding an issue i have encountered with prometheus alert tagging, specifically while creating tickets in Opsgenie.

I have configured alertmanager to send alerts to Opsgenie as , the configuration as :

i ticket is generated with expected description and tags as -

Now, by default the alerts are grouped by the alert name( default behavior).So when the similar event happens on a different server i see that the description is updated as:

but the tag on the ticket remains same,
expected behavior: criteria=..., host=108, host=114, infra.....support

I have set update_alert and send_resolved settings to true.
I am not sure that in order to make it work as expected, If i need additional configuration at opsgenie or at the alertmanager.

I would appreciate any insight or guidance on the method to resolve this issue and ensure that alerts for different servers are correctly tagged in Opsgenie.

Thank you in advance.
Regards,

CP

Brian Candler

unread,

Apr 2, 2024, 1:16:36 PM4/2/24

to Prometheus Users

FYI, those images are unreadable - copy-pasted text would be much better.

My guess, though, is that you probably don't want to group alerts before sending them to opsgenie. You haven't shown your full alertmanager config, but if you have a line like

group_by: ['alertname']

then try

group_by: ["..."]

(literally, exactly that: a single string containing three dots, inside square brackets)

mohan garden

unread,

Apr 3, 2024, 4:07:12 AM4/3/24

to Prometheus Users

Hi Brian,
Thank you for the response, Here are some more details, hope this will help you in gaining more understanding into the configuration and method i am using to generate tags :

1. We collect data from the node exporter, and have created some rules around the collected data. Here is one example -
- alert: "Local Disk usage has reached 50%"
expr: (round( node_filesystem_avail_bytes{mountpoint=~"/dev.*|/sys*|/|/home|/tmp|/var.*|/boot.*",} / node_filesystem_size_bytes{mountpoint=~"/dev.*|/sys*|/|/home|/tmp|/var.*|/boot.*"} * 100 ,0.1) >= y ) and (round( node_filesystem_avail_bytes{mountpoint=~"/dev.*|/sys*|/|/home|/tmp|/var.*|/boot.*"} / node_filesystem_size_bytes{mountpoint=~"/dev.*|/sys*|/|/home|/tmp|/var.*|/boot.*"} * 100 ,0.1) <= z )
for: 5m
labels:

criteria: overuse
severity: critical
team: support
annotations:
summary: "{{ $labels.instance }} 's ({{ $labels.device }}) has low space."
description: "space on {{ $labels.mountpoint }} file system at {{ $labels.instance }} server = {{ $value }}%."

2. at the alert manager , we have created notification rules to notify in case the aforementioned condition occurs:

smtp_from: 'ser...@example.com'

smtp_require_tls: false
smtp_smarthost: 'ser...@example.com:25'

templates:
- /home/ALERTMANAGER/conf/template/*.tmpl

route:
group_wait: 5m
group_interval: 2h
repeat_interval: 5h
receiver: admin
routes:
- match_re:
alertname: ".*Local Disk usage has reached .*%"
receiver: admin
routes:
- match:
criteria: overuse
severity: critical
team: support
receiver: mailsupport
continue: true
- match:
criteria: overuse
team: support
severity: critical
receiver: opsgeniesupport

receivers:
- name: opsgeniesupport
opsgenie_configs:
- api_key: XYZ
api_url: https://api.opsgenie.com
message: '{{ .CommonLabels.alertname }}'
description: "{{ range .Alerts }}{{ .Annotations.description }}\n\r{{ end }}"
tags: '{{ range $k, $v := .CommonLabels}}{{ if or (eq $k "criteria") (eq $k "severity") (eq $k "team") }}{{$k}}={{$v}},{{ else if eq $k "instance" }}{{ reReplaceAll "(.+):(.+)" "host=$1" $v }},{{end}}{{end}},infra,monitor'
priority: 'P1'
update_alerts: true
send_resolved: true
...

So you can see that i derive a tag host=<hostname> from the instance label.

Scenario1: When server1 's local disk usage reaches 50%, i see that Opsgenie ticket is created having:
Opsgenie Ticket metadata:
ticket header name: local disk usage reached 50%

ticket description: space on /var file system at server1:9100 server = 82%."
ticket tags: criteria: overuse , team: support, severity: critical, infra,monitor,host=server1

so everything works as expected, no issues with Scenario1.

Scenario2: While server1 trigger is active, a second server ( say server2)'s local disk usage reaches 50%,

i see that Opsgenie tickets are getting updated as:

ticket header name: local disk usage reached 50%

ticket description: space on /var file system at server1:9100 server = 82%."

ticket description: space on /var file system at server2:9100 server = 80%."
ticket tags: criteria: overuse , team: support, severity: critical, infra,monitor,host=server1

but i was expecting an additional host=server2 tag on the ticket.
in Summary - i see updated description , but unable to see updated tags.

in tags section of the alertmanager - opsgenie integration configuration , i had tried iterating over Alerts and CommonLabels, but i was unable to add additional host=server2 tag .
{{ range $idx, $alert := .Alerts}}{{range $k, $v := $alert.Labels }}{{$k}}={{$v}},{{end}}{{end}},test=test
{{ range $k, $v := .CommonLabels}}....{{end}}

At the moment, i am not sure that what is potentially preventing the update of tags on the opsgenie tickets.
If i can get some clarity on the fact that if the configurations i have for alertmanager are good enough, then i can look at the opsgenie configurations.

Please advice.

Regards
CP

mohan garden

unread,

Apr 3, 2024, 4:11:24 AM4/3/24

to Prometheus Users

*correction:

Scenario2: While server1 trigger is active, a second server ( say server2)'s local disk usage reaches 50%,

i see that the already open Opsgenie ticket's details gets updated as:

ticket header name: local disk usage reached 50%

ticket description: space on /var file system at server1:9100 server = 82%."

space on /var file system at server2:9100 server = 80%."

ticket tags: criteria: overuse , team: support, severity: critical, infra,monitor,host=server1

Brian Candler

unread,

Apr 3, 2024, 8:14:17 AM4/3/24

to Prometheus Users

> but i was expecting an additional host=server2 tag on the ticket.

You won't get that, because CommonLabels is exactly how it sounds: those labels which are common to all the alerts in the group. If one alert has instance=server1 and the other has instance=server2, but they're in the same alert group, then no 'instance' will appear in CommonLabels.

The documentation is here:

https://prometheus.io/docs/alerting/latest/notifications/

It looks like you could iterate over Alerts.Firing then the Labels within each alert.

Alternatively, you could disable grouping and let opsgenie do the grouping (I don't know opsgenie, so I don't know how good a job it would do of that)

mohan garden

unread,

Apr 3, 2024, 11:01:21 AM4/3/24

to Prometheus Users

Thank you for the pointers. I tried -
tags: '{{ range .Alerts.Firing }} {{ range .Labels.SortedPairs }} {{ .Name }}={{ .Value }}, {{ end }} {{end}}'

but did not see any change in the outcome.
i see all the tags (alertname, job,instance, ...) - but only from the first Alert, the tags from the second alert did not show up.

Is there a way i can see the entire message which alert manager sends out to the Opsgenie? - somewhere in the alertmanager logs or a text file?
That would help me to understand whether the alert manager is sending all the tags and its the Opsgenie which may be dropping those extra tags.

Regards
CP

Puneet Singh

unread,

Apr 3, 2024, 11:34:40 AM4/3/24

to Prometheus Users

UPDATE:
i had a look at the https://docs.opsgenie.com/docs/alert-api#add-tags-to-alert
using the following API -

curl -X POST https://api.opsgenie.com/v2/alerts/<alert id>/tags?identifierType=id -H "Content-Type: application/json" -H "Authorization: GenieKey <api key>" -d '{ "tags": ["host=testserver","instance=testserver123"], "user":"Monitoring Script", "note":"Action executed via Alert API" }' I was able to append additional tags to the existing opsgenie tickets.

So i think there is no restriction from the Opsgenie's end . The tag update issue should be taken care by the Alertmanager's opsgenie plugin.
Not sure that internally how the Alert manager sends tags section information to the opsgenie API when new alerts (part of same alert group) come in .

Brian Candler

unread,

Apr 3, 2024, 12:29:06 PM4/3/24

to Prometheus Users

On Wednesday 3 April 2024 at 16:01:21 UTC+1 mohan garden wrote:

Is there a way i can see the entire message which alert manager sends out to the Opsgenie? - somewhere in the alertmanager logs or a text file?

You could try setting api_url to point to a webserver that you control.

mohan garden

unread,

Jul 27, 2024, 11:39:57 AM7/27/24

to Prometheus Users

Hi Brian,
Thank you for the suggestion,
I was able to setup a flask application to monitor the data sent by alert manager for opsgenie using api_url end point.
I had to create 3 end points
1. POST for - /
2. PUT for /v2/alerts/message
3. PUT for /v2/alerts/description

POST:
{'alias': '<mangled>71c5c169a773796b467cc741f70457c4', 'message': 'Type1 Server is down or node exporter is unreachable', 'description': 'server1:9100 server is down or prometheus is unable to query the node exporter service which should be up and running.\n\rserver2:9100 server is down or prometheus is unable to query the node exporter service which should be up and running.\n\r', 'details': {'SERVER_CATEGORY': 'Type1', 'SERVER_SITE': 'ind', 'alertname': 'Type1 Server is down or node exporter is unreachable', 'criteria': 'nodedown', 'job': 'default_nodeexporters', 'severity': 'critical', 'team': 'infrasupport'}, 'source': 'http://alertmanager:9093/#/alerts?receiver=opsgenie_support', 'tags': ['SERVER_CATEGORY=Type1', 'SERVER_SITE=ind', 'criteria=nodedown', 'severity=critical', 'team=support', 'support', 'monitor', 'server1:9100', 'server2:9100'], 'priority': 'P1'}
10.73.6.210 - - [27/Jul/2024 07:32:04] "POST /v2/alerts HTTP/1.1" 200 -

First PUT:
{'message': 'Utility Server is down or node exporter is unreachable'}
10.73.6.210 - - [27/Jul/2024 07:32:04] "PUT /v2/alerts/<mangled>71c5c169a773796b467cc741f70457c4/message?identifierType=alias HTTP/1.1" 200 -

Second PUT:
{'description': 'server1:9100 server is down or prometheus is unable to query the node exporter service which should be up and running.\n\rserver2:9100 server is down or prometheus is unable to query the node exporter service which should be up and running.\n\r'}
10.73.6.210 - - [27/Jul/2024 07:32:04] "PUT /v2/alerts/<mangled>71c5c169a773796b467cc741f70457c4/description?identifierType=alias HTTP/1.1" 200 -

It seems the alert manager needs to send another PUT request for updating the opsgenie tags.

mohan garden

unread,

Jul 27, 2024, 11:57:24 AM7/27/24

to Prometheus Users

I plan to disable the grouping only for opsgenie routes and for specific set of alerts. Here is the example of current alert manager configuration -
Example -

route:
group_wait: 5m
group_interval: 5m
repeat_interval: 7h

receiver: admin
routes:
- match_re:

alertname: ".* Type1 Server is down.* "
receiver: admingroup2
routes:
- match:

team: support
severity: critical
receiver: opsgeniesupport

group_wait: 1m
group_interval: 5m
repeat_interval: 6h
continue: true
- match:
team: support
severity: critical
receiver: mailsupport
group_wait: 1m
group_interval: 1h
repeat_interval: 12h

Q1: Is is possible to disable the grouping for specific type of alerts ( Say Type1 keyword in alert manager) only for opsgenie route? I am looking for something like -

- match:

team: support
severity: critical
receiver: opsgeniesupport

group_by: [instance]
group_wait: 1m
group_interval: 5m
repeat_interval: 6h
continue: true
- match:
team: support
severity: critical
receiver: mailsupport
group_by: [instance]
group_wait: 1m
group_interval: 1h
repeat_interval: 12h
Is this allowed by Alert Manager?

Q2: Is it possible to change the alert name from the prometheus before prometheus dispatches alert to the alert manager?
- alert: "Type1 down or process monitoring service is unreachable"
expr: up{ SERVER_CATEGORY='Type1' } == 0
for: 2m
labels:

severity: critical
team: support
annotations:

summary: "{{ $labels.instance }} is not reachable"
description: "{{ $labels.instance }} is not reachable"

- alert: " Type1 down or process monitoring service is unreachable - {{ $labels.instance}} "

Hopefully this will help me as i am unable to get the appropriate tags in opsgenie using grouping.
Having host name tag will be helpful and we can know via JIRA integration that how many incidents have occured for a host in past.

Regards
MG

Brian Candler

unread,

Jul 27, 2024, 1:21:56 PM7/27/24

to Prometheus Users

Q1 - yes, each route can have separate group_by section, as shown in the documentation:

https://prometheus.io/docs/alerting/latest/configuration/#route-related-settings

Note that if you do

group_by: [instance]

then you'll get one Opsgenie alert group for an instance, even if there are multiple problems with that instance. If you want to disable grouping completely, put a string with three dots between the square brackets:

group_by: ['...']

Q2 - I don't see why you want to put {{ $labels.instance }} in the alert name. It's then no longer the name of the alert, it's a combination of the name of the alert and the name of the instance; and to analyze the data by instance you'd have to parse it out of the alert.

Put it in the alert description instead.

> Having host name tag will be helpful and we can know via JIRA integration that how many incidents have occured for a host in past.

Surely it would be better to do this is analysis with alert labels, and from what I can see of the POST content you showed, Opsgenie calls these "tags" rather than "labels".

> It seems the alert manager needs to send another PUT request for updating the opsgenie tags.

Are you saying that the problem is that Alertmanager isn't updating the tags? But if these tags come from CommonLabels, and the alerts are part of a group, then the CommonLabels are by definition those which are common to all the alerts in the group.

It seems to me that there are two meaningful alternatives. Either:

1. multiple alerts from Prometheus are in the same group (in which case, it's a single alert as far as Opsgenie is concerned, and the tags are the labels common to all alerts in the group); or

2. you send separate alerts from Prometheus, each with their own tags, and then you analyze and/or group them Opsgenie-side.

If host-by-host incident analysis is what you want, then option (2) seems to be the way to go.

What version of Alertmanager are you running? Looking in the changelogs I don't see any particular recent changes, and I notice you're already using "update_alerts: true", but I thought it was worth checking.

## 0.25.0 / 2022-12-22

* [ENHANCEMENT] Support templating for Opsgenie's responder type. #3060

## 0.24.0 / 2022-03-24

* [ENHANCEMENT] Add `update_alerts` field to the OpsGenie configuration to update message and description when sending alerts. #2519
* [ENHANCEMENT] Add `entity` and `actions` fields to the OpsGenie configuration. #2753
* [ENHANCEMENT] Add `opsgenie_api_key_file` field to the global configuration. #2728
* [ENHANCEMENT] Add support for `teams` responders to the OpsGenie configuration. #2685

## 0.22.0 / 2021-05-21

* [ENHANCEMENT] OpsGenie: Propagate labels to Opsgenie details. #2276

Reply all

Reply to author

Forward

Prometheus alert tagging issue - multiple servers

mohan garden

Brian Candler

mohan garden

mohan garden

Brian Candler

mohan garden

Puneet Singh

Brian Candler

mohan garden

mohan garden

Brian Candler