Service Discovery, Relabeling for Alert Exemptions

76 views
Skip to first unread message

Michael Kogelman

unread,
Jul 2, 2022, 9:14:49 PM7/2/22
to Prometheus Users
Hey all,

First time long time here -- love Prom.

I'm a bit stumped and was hoping maybe someone could tell me where I'm not connecting.

I'm currently using a service discovery plugin to pull inventory from our source of truth, netbox.

Netbox returns a __meta_netbox_tags in the form of tag,tag,tag

I'm trying to relabel this field and save it into the timeseries so that it can be used to exempt an object from certain alerts utilizing the absence of the tag !=.

Here's what the output looks like for service discovery:

        "targets": [
            "server"
        ],
        "labels": {
            "__meta_netbox_status": "active",
            "__meta_netbox_model": "VirtualMachine",
            "__meta_netbox_name": "server",
            "__meta_netbox_primary_ip": "x.x.x.x",
            "__meta_netbox_primary_ip4": "x.x.x.x",
            "__meta_netbox_platform": "Linux (64-bit)",
            "__meta_netbox_platform_slug": "linux-64-bit",
            "__meta_netbox_tags": "NetBox-synced,prod,exempt-highcpu, exempt-highmem",
            "__meta_netbox_tag_slugs": "exempt-highmem, exempt-highcpu",
            "__meta_netbox_cluster": "production",
            "__meta_netbox_cluster_group": "XXX",
            "__meta_netbox_cluster_type": "VMware",
            "__meta_netbox_site": "XXX",
            "__meta_netbox_site_slug": "xxx",
            "__meta_netbox_role": "Server",
            "__meta_netbox_role_slug": "server"

Here's the latest iteration what I'm trying to do in prometheus.yml (when i try to use the separators to tell it there's a comma the yaml stops parsing):

    metric_relabel_configs:
      - source_labels: [__meta_netbox_tags]
        regex: '.*'
        replacement: '$1'
        target_label: tags

And here's what I'm trying to do in our rules.yml:

  - alert: HostHighCpuLoad
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle",tags!="highcpuexempt"}[5m])) * 100) > 85
    for: 10m
    labels:
      severity: warning
    annotations:
      identifier: '{{ $labels.instance }}'
      summary: "Host high CPU load (instance {{ $labels.instance }})"
      description: "CPU load is > 85% for 10 minutes\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

Anyone assistance would be greatly appreciated!!

Thanks,
Mike




Ben Kochie

unread,
Jul 3, 2022, 4:01:10 AM7/3/22
to Michael Kogelman, Prometheus Users
There are two problems.

Your regexp is missing a capture group.
You are using METRIC_relabel_configs.

METRIC_relabel_configs happen after the scrape, after the metadata has already been set. You want to use "relabel_configs" to be able to read the discovery metadata.

Note the default value for this is "(.*)". Simply omit these fields from your config and the defaults will be used.

relabel_configs:
- source_labels: [__meta_netbox_tags]
  target_label: tags


As for your alert, it won't work because your string matching is literal. Since your tags are comma lists, you will have to use a regexp.

expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle",tags!~".*highcpuexempt.*"}[5m])) * 100) > 85

Unfortunately, the netbox tag list doesn't contain leading and trailing commas in the list, so you can't use the match of tags!~".*,highcpuexempt,.*" to get more precise matching.

Lastly, I would not use this alert anyway. It's a type of "Cause alert", rather than a "Symptom alert". As you have noticed, you have to make exceptions. This style of "My CPU is too high" is toil prone. They generate false positives leading to alert fatigue.



--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/591bccb8-b04a-4150-a0ad-fe3edd7dd0f7n%40googlegroups.com.

Brian Candler

unread,
Jul 3, 2022, 5:16:29 AM7/3/22
to Prometheus Users
Out of interest, what's the netbox service discovery plugin you're using - can you give a link please?

> when i try to use the separators to tell it there's a comma the yaml stops parsing

Then it must be a syntax error in your yaml. There are cases when strings *must* be quoted, and since comma has significance within YAML itself, this is probably one of those cases. You'd need:

    separator: ','

But it doesn't matter because this setting won't do anything anyway here. It's used when concatenating the values of multiple "source_labels:", but you have listed only one.

> Unfortunately, the netbox tag list doesn't contain leading and trailing commas in the list, so you can't use the match of tags!~".*,highcpuexempt,.*" to get more precise matching.

At least not without a more fancy regexp. I think either of these should work:

tags!~"(.*,|)highcpuexempt(.*,|)"

tags!~"(.*,)?highcpuexempt(.*,)?"

> This style of "My CPU is too high" is toil prone. They generate false positives leading to alert fatigue.

I strongly agree. Another useful reference:

Michael Kogelman

unread,
Jul 3, 2022, 1:11:16 PM7/3/22
to Brian Candler, Prometheus Users
I did resolve this by going back to relabel_configs with some other examples I found, so the exemption is working. I do hear you about the high cpu style alerting but our team seems to want it — will dig in some more.

The netbox http sd we’re using is:


Thanks for your help!

From: promethe...@googlegroups.com <promethe...@googlegroups.com> on behalf of Brian Candler <b.ca...@pobox.com>
Sent: Sunday, July 3, 2022 5:16 AM
To: Prometheus Users <promethe...@googlegroups.com>
Subject: Re: [prometheus-users] Service Discovery, Relabeling for Alert Exemptions
 
--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages