Alertmanager slack alerting issues

930 views
Skip to first unread message

rs

unread,
Aug 12, 2022, 4:36:22 AM8/12/22
to Prometheus Users
Hi everyone! I am configuring alertmanager to send outputs to a prod slack channel and dev slack channel. I have checked with the routing tree editor and everything should be working correctly. 
However, I am seeing some (not all) alerts that are tagged with 'env: dev' being sent to the prod slack channel. Is there some sort of old configuration caching happening? Is there a way to flush this out?

--- Alertmanager.yml ---
global:
  http_config:
    proxy_url: 'xyz'
templates:
  - templates/*.tmpl
route:
  group_by: [cluster,alertname]
  group_wait: 10s
  group_interval: 30m
  repeat_interval: 24h
  receiver: 'slack'
  routes:
  - receiver: 'production'
    match:
      env: 'prod'
    continue: true
  - receiver: 'staging'
    match:
      env: 'dev'
    continue: true
receivers:
#Fallback option - Default set to production server
- name: 'slack'
  slack_configs:
  - api_url: 'api url'
    channel: '#prod-channel'
    send_resolved: true
    color: '{{ template "slack.color" . }}'
    title: '{{ template "slack.title" . }}'
    text: '{{ template "slack.text" . }}'
    actions:
      - type: button
        text: 'Query :mag:'
        url: '{{ (index .Alerts 0).GeneratorURL }}'
      - type: button
        text: 'Silence :no_bell:'
        url: '{{ template "__alert_silence_link" . }}'
      - type: button
        text: 'Dashboard :grafana:'
        url: '{{ (index .Alerts 0).Annotations.dashboard }}'
- name: 'staging'
  slack_configs:
  - api_url: 'api url'
    channel: '#staging-channel'
    send_resolved: true
    color: '{{ template "slack.color" . }}'
    title: '{{ template "slack.title" . }}'
    text: '{{ template "slack.text" . }}'
    actions:
      - type: button
        text: 'Query :mag:'
        url: '{{ (index .Alerts 0).GeneratorURL }}'
      - type: button
        text: 'Silence :no_bell:'
        url: '{{ template "__alert_silence_link" . }}'
      - type: button
        text: 'Dashboard :grafana:'
        url: '{{ (index .Alerts 0).Annotations.dashboard }}'
- name: 'production'
  slack_configs:
  - api_url: 'api url'
    channel: '#prod-channel'
    send_resolved: true
    color: '{{ template "slack.color" . }}'
    title: '{{ template "slack.title" . }}'
    text: '{{ template "slack.text" . }}'
    actions:
      - type: button
        text: 'Query :mag:'
        url: '{{ (index .Alerts 0).GeneratorURL }}'
      - type: button
        text: 'Silence :no_bell:'
        url: '{{ template "__alert_silence_link" . }}'
      - type: button
        text: 'Dashboard :grafana:'
        url: '{{ (index .Alerts 0).Annotations.dashboard }}'

Brian Candler

unread,
Aug 12, 2022, 11:29:34 AM8/12/22
to Prometheus Users
Firstly, I'd drop the "continue: true" lines. They are not required, and are just going to cause confusion.

The 'slack' and 'production' receivers are both sending to #prod-channel.  So you'll hit this if the env is not exactly "dev".  I suggest you look in detail at the alerts themselves: maybe they're tagging with "Dev" or "dev " (with a hidden space).

If you change the default 'slack' receiver to go to a different channel, or use a different title/text template, it will be easier to see if this is the problem or not.

rs

unread,
Aug 22, 2022, 1:45:25 PM8/22/22
to Prometheus Users
I checked the json file and the tagging was correct. Here's an example:


   {

       "labels": {

           "cluster": "X Stage Servers",

           "env": "dev"

       },

       "targets": [

           "x:9100",

           "y:9100",

           "z:9100"

       ]

   },
This is being sent to the production/default channel.

Brian Candler

unread,
Aug 22, 2022, 3:06:47 PM8/22/22
to Prometheus Users
"Looks correct but still doesn't work how I expect"

What you've shown is a target configuration, not an alert arriving at alertmanager.

Therefore, I'm suggesting you take a divide-and-conquer approach.  First, work out which of your receiver routing rules is being triggered (is it the 'production' receiver, or is it the 'slack' receiver?) by making them different.  This will point to which routing rule is or isn't being triggered.  And then you can work out why.

There are all sorts of reasons it might not work, other than the config you've shown.  For example, if you have any target rewriting or metric rewriting rules which set the env; if the exporter itself sets "env" and you have honor_labels set; and so on.

Hence the first part is to find out from real alert events: is the alert being generated without the "dev" label? In that case alert routing is just fine, and you need to work out why that label is wrong (and you're looking at the prometheus side). Or is the alert actually arriving at alertmanager with the "dev" label, in which case you're looking at the alertmanager side to find out why it's not being routed as expected.

rs

unread,
Aug 22, 2022, 4:21:51 PM8/22/22
to Prometheus Users
Thanks Brian, I am in the midst of setting up a slack receiver (to weed out the alerts going to the wrong channel). One thing I have noticed is, the alerts being routed incorrectly may actually have to do with a rule:

- alert: High_Cpu_Load

expr: 100 - (avg by(instance,cluster) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 95

for: 0m

labels:

severity: warning

annotations:

summary: Host high CPU load (instance {{ $labels.instance }})

description: "CPU load is > 95%\n INSTANCE = {{ $labels.instance }}\n VALUE = %{{ $value | humanize }}\n LABELS = {{ $labels }}"

I believe the issue may be that I'm not passing in 'env' into the expression and that is causing an issue with the alerts. Just a hunch, but I appreciate you pointing me in the right direction!

Brian Candler

unread,
Aug 23, 2022, 2:45:59 AM8/23/22
to Prometheus Users
Yes, you've got it.  It's easy to test your hypothesis: simply paste the alert rule expression

    100 - (avg by(instance,cluster) (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 95

into the PromQL query browser in the prometheus web interface, and you'll see all the results - including their labels.

I believe you'll get results like

{instance="foo",cluster="bar"} 98.4

There won't be any "env" label there because you've aggregated it away.

Try using: avg by(instance,cluster,env) instead.

Or you could have separate alerting rules per environment, and re-apply the label in your rule:

    expr: 100 - (avg by(instance,cluster) (rate(node_cpu_seconds_total{env="dev",mode="idle"}[2m])) * 100) > 98
    labels:
      env: dev

jaouad zarrabi

unread,
Sep 25, 2022, 6:26:03 PM9/25/22
to Prometheus Users
BullionStar is Singapore's Premier Bullion Dealer For Sell  : GOLD / SILVER / BARS / COINS
- Over 1,000 Different Products
-  Cash & Bullion Account
- Attractive Prices
- Quick & Easy
-Tax Free Bullion
- Financial Strength
- Global Reach
- Multi-Jurisdiction
https://www.bullionstar.com/?r=27869
Reply all
Reply to author
Forward
0 new messages