How to filter monitored interfaces with snmp-exporter and Prometheus

591 views
Skip to first unread message

Sebastiaan van Doesselaar

unread,
Oct 11, 2023, 3:05:41 AM10/11/23
to Prometheus Users
Hi all,

I've got a PoC running using nautobot (source of truth, scrape with SD), prometheus and snmp-exporter. It's scaling very well, much better than we anticipated and we're very eager to put the finishing touches on the PoC. 

I want to set a bunch of interfaces to either "monitored" or not. Based on that, I will generate a label. For example: "__meta_nautobot_monitored_interfaces": "Fa0,Fa1". 

A full return from Nautobot might be something like:

[ { "targets": [ "ORDER12345678" ], "labels": { "__meta_nautobot_status": "Active", "__meta_nautobot_model": "Device", "__meta_nautobot_name": "ORDER12345678", "__meta_nautobot_id": "c301aebf-e92d-4f72-8f2b-5768144c42f4", "__meta_nautobot_primary_ip": "xxx", "__meta_nautobot_primary_ip4": "xxx", "__meta_nautobot_monitored_interfaces": "Fa0,Fa1", "__meta_nautobot_role": "CPE", "__meta_nautobot_role_slug": "cpe", "__meta_nautobot_device_type": "ASR9006", "__meta_nautobot_device_type_slug": "asr9006", "__meta_nautobot_site": "Place", "__meta_nautobot_site_slug": "place" } } ]


My question is two-fold:
  - I'd like to drop all unrelated interfaces. I don't know how to do that using relabeling with this comma separated string. I'm open to presenting the data in a different way, since I made the prometheus service discovery plugin myself (based on the netbox one), but I haven't thought of a better way.

  - I only want to alert on the monitored interfaces. I mean, if we fix the above, this is a non-issue, but if that's not possible, I'd like to at least only alert on the monitored interfaces.

This is going to be run on ~20.000 devices with all differing configuration, models, vendor, etc.  It needs to be very dynamic and all sourced from Nautobot, as well as be refreshed when changes are made there. Using snmp.yml and a generator to do so doesn't seem feasible unless I'd create a new module for every possible configuration, and that doesn't seem ideal.

If anyone has any suggestions on how to accomplish this, I'd very much appreciate it. 

Ben Kochie

unread,
Oct 11, 2023, 4:15:26 AM10/11/23
to Sebastiaan van Doesselaar, Prometheus Users
For alerting on monitored interfaces I might suggest a different approach than trying to apply them at discovery time. The discovery phase is able to apply labels to the whole target device easily, but it's not really going to work well to annotate individual metrics.

What I would suggest is you populate a series of recording rules that define which interfaces should be alerted on. Then you can use a join at alert query time. This is also how you can set different alerting thresholds for things dynamically.

For example if you have this rule:

groups:
- name: monitored interfaces
  interval: 1m
  rules:
    - record: monitored_interface_info
      expr: vector(1)
      labels:
        instance: ORDER12345678
        ifDescr: Fa0
    - record: monitored_interface_info
      expr: vector(1)
      labels:
        instance: ORDER12345678
        ifDescr: Fa1

Then your alert would look like this:

- name: alerts
  rules:
    - alert: InterfaceDown
      expr: ifOperStatus == 0 * on (instance,ifDescr) monitored_interface_info
      for: 5m

You can use nautbot database to generate the rules file.

Another approach would be to populate the monitored interface information in your devices. If you can tag the interface descriptions/aliases with a structured format you can use metric_relabel_configs to create a monitored_interface label

So if your interface description is say Fa0;true, you can do something like this:
metric_relabel_configs:
- source_labels: [ifDescr]
  regex: '.+;(.+)'
  target_label: monitored_interface
- source_labels: [ifDescr]
  regex: '(.+);.+'
  target_label: ifDescr

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/fc257af7-6611-4dcd-a0d0-cbf78c1b3a38n%40googlegroups.com.

Sebastiaan van Doesselaar

unread,
Oct 11, 2023, 9:22:04 AM10/11/23
to Prometheus Users
Thank you very much for the pointers. I'd considered a recording rule might work when I did my Google adventures, but as you mention this works very well indeed.

I had to slightly modify your query to:  (ifOperStatus != 1) * on (instance,ifName) monitored_interface_info

Then it gives me the perfect result actually. Probably still needs some finetuning, but that's fine.

To get back to your second suggestion: this unfortunately is not an option for us. We're not always in full control of what we monitor unfortunately. If we had been, that would be the easier and better solution indeed.

Two questions left:
  • Any recommended/supported way of loading the rules dynamically? I saw you'd need a SIGHUP to reload them, so I could script it easily. Preferably I'd use something (natively) supported though, like the http_sd_config setup we use to do service discovery.
  • What'll the impact on performance be if we have say, (20000 instances, each with 2-5 monitored interfaces) 60000 recording rules like this? I imagine it'll either be peanuts for Prometheus, or heavier than I imagine.

Brian Candler

unread,
Oct 12, 2023, 5:00:20 AM10/12/23
to Prometheus Users
If I understand what you're doing, I wouldn't have 60000 static recording rules, I would just create a text file like this:

monitored_interface_info{instance="ORDER12345678",ifDescr="Fa0"} 1
monitored_interface_info{instance="ORDER12345678",ifDescr="Fa1"} 1
... etc

Then I would either stick this on a webserver and scrape it, or drop it into a file for node_exporter's textfile collector to pick up. Either way would need "honor_labels: true" to preserve the 'instance' label (or you can put it in a different label, and then use metric relabelling to move it).

This also solves your problem about changes. There's no need to HUP prometheus, it'll update on the next scrape.

Sebastiaan van Doesselaar

unread,
Oct 12, 2023, 8:56:29 AM10/12/23
to Prometheus Users
That makes a lot of sense! Nautobot supports exposing prometheus metrics, and I modified my service discovery plugin to do just that.
I'm going to finetune this a bit more, and then that'll work. I feel like this was a logical conclusion for this, but I really needed the hints you guys gave. Thanks!

This will end up exposing 20k*3-5 (1 for every monitored interface on 20k devices), so let's say 75k extra metrics in the long run though. I assume Prometheus won't really feel any pain for that? I'm assuming for Prometheus, 75k isn't that much on a whole.

Ben Kochie

unread,
Oct 12, 2023, 10:18:41 AM10/12/23
to Sebastiaan van Doesselaar, Prometheus Users
Yes, 75k is no problem.  Typically I tell application developers to keep below 10k metrics per instance. But for inventory stuff like this it's fine to have more. I like to keep individual scrapes below a few hundred thousand.

One pro tip here, make sure your exporter endpoints support gzip http encoding to reduce the response size over the wire.

A decent size Prometheus can handle a few tens of millions of series total.

Sebastiaan van Doesselaar

unread,
Oct 16, 2023, 1:59:14 AM10/16/23
to Prometheus Users
I'll check out gzip encoding. I think it already does, but I'll verify.

Thanks again to you both!

Reply all
Reply to author
Forward
0 new messages