Summarize large SNMP table

132 views
Skip to first unread message

Cameron Kerr

unread,
Dec 15, 2019, 5:03:18 PM12/15/19
to Prometheus Users
I am looking at data from Cisco Wireless Controllers on a very large wireless site; let's say on the order of 5,000 APs, with obviously many many more clients.

My main value I'm wanting to get from this is to try and find APs / Buildings that have poor service quality (how best to define that, I'm not yet sure). To assist me in this I've already got snmp_exporter configured to poll the WLC (Wireless LAN Controllers) getting information about APs, and using some metric rewriting to extract the building code from the AP name.

To make this more concrete, there is an SNMP table CISCO-LWAPP-DOT11-CLIENT-MIB:cldcClientTable, which contains one row for each client. I certainly don't want that kind of cardinality problem, so I'm wanting to get some summary data. One of the columns is cldcClientReasonCode, which gives the reason for a client being disassociated, and is one of the columns I would be most interesting in exposing.

What I think I would like is a metric like the following:

cldcClientTable_summary_ClientReasonCode{ap=AP_NAME,reason_code=previousAuthNotValid} = 12
cldcClientTable_summary_ClientReasonCode{ap=AP_NAME,reason_code=disassociationStaHasLeft} = 23
cldcClientTable_summary_count{ap=AP_NAME,reason_code=previousAuthNotValid} = 113

(I've not done any summary type of things yet in Prometheus, so my strategy may be off).

Is this something that is achievable using snmp_exporter (I don't think it does), or should I expect to create my own exporter for this? If people think it may be something of value, I could have a stab at extending snmp_exporter (and the generator....), although I haven't done (much) Golang.

I'm wondering what the generator.yml would look like. Currently, my generator.yml (actually, this is the Jinja template that Ansible will populate) contains this:

cisco_wireless_controllers:
{{ snmp_credentials.cisco_wireless_controllers | to_nice_yaml(indent=2) | indent(4) }}
walk:
- 1.3.6.1.4.1.9.9.513.1.1.1 # CISCO-LWAPP-AP-MIB::cLApTable
- 1.3.6.1.4.1.14179.2.1.1 # AIRESPACE-WIRELESS-MIB::bsnDot11EssTable
lookups:
- source_indexes: ['cLApSysMacAddress']
lookup: cLApName
drop_source_indexes: true
- source_indexes: ['bsnDot11EssIndex']
lookup: bsnDot11EssSsid
drop_source_indexes: true

overrides:

# CISCO-LWAPP-AP-MIB::cLApTable, which is very wide and often useful. We omit many columns
# It would be very helpful if we could specify a whitelist of table columns
#
cLApSysMacAddress: # index for each row
type: PhysAddress48
ignore: false
cLApIfMacAddress:
type: PhysAddress48
ignore: false
cLApMaxNumberOfDot11Slots:
ignore: true
cLApEntPhysicalIndex:
ignore: true
cLApName:
ignore: false
cLApUpTime:
ignore: false
cLLwappUpTime:
ignore: false
... and many other ignores; which I generated with some scripting based around snmptable to get just the headers, and ignored everything by default

You'd only want to summary particular tables though, so it might be more useful to have this specied under the relevant 'walk' key.

Maybe something like this would be useful:

walk:
- 1.3.6.1.4.1.9.9.513.1.1.1 # CISCO-LWAPP-AP-MIB::cLApTable
- 1.3.6.1.4.1.14179.2.1.1 # AIRESPACE-WIRELESS-MIB::bsnDot11EssTable
- oid: 1.3.6.1.4.1.9.9.599.1.3.1 # CISCO-LWAPP-DOT11-CLIENT-MIB:cldcClientTable
# EITHER produce a summary for each column of the table independently
summarize_individually:
- cldcClientStatus
- cldcClientReasonCode
- cldcClientProtocol
- cldcAssociationMode
# OR perhaps the count for each combination of summarized values, but
# potential cardinality risk there though, perhaps.
summarize_combinations:
- cldcClientStatus
- cldcClientReasonCode
lookups:

Thanks for all the work on this product, its been very very useful.

Cheers,
Cameron

Cameron Kerr

unread,
Dec 15, 2019, 5:06:59 PM12/15/19
to Prometheus Users
Edit:

cldcClientTable_summary_ClientReasonCode{ap=AP_NAME,reason_code=previousAuthNotValid} = 12
cldcClientTable_summary_ClientReasonCode{ap=AP_NAME,reason_code=disassociationStaHasLeft} = 23
cldcClientTable_summary_count{ap=AP_NAME} = 113

Although come to think of it, I'm not sure that we would even get the AP in this table.

Brian Brazil

unread,
Dec 15, 2019, 5:19:59 PM12/15/19
to Cameron Kerr, Prometheus Users
On Sun, 15 Dec 2019 at 22:03, Cameron Kerr <cameron...@gmail.com> wrote:
I am looking at data from Cisco Wireless Controllers on a very large wireless site; let's say on the order of 5,000 APs, with obviously many many more clients.

My main value I'm wanting to get from this is to try and find APs / Buildings that have poor service quality (how best to define that, I'm not yet sure). To assist me in this I've already got snmp_exporter configured to poll the WLC (Wireless LAN Controllers) getting information about APs, and using some metric rewriting to extract the building code from the AP name.

To make this more concrete, there is an SNMP table CISCO-LWAPP-DOT11-CLIENT-MIB:cldcClientTable, which contains one row for each client. I certainly don't want that kind of cardinality problem, so I'm wanting to get some summary data. One of the columns is cldcClientReasonCode, which gives the reason for a client being disassociated, and is one of the columns I would be most interesting in exposing.

What I think I would like is a metric like the following:

cldcClientTable_summary_ClientReasonCode{ap=AP_NAME,reason_code=previousAuthNotValid} = 12
cldcClientTable_summary_ClientReasonCode{ap=AP_NAME,reason_code=disassociationStaHasLeft} = 23
cldcClientTable_summary_count{ap=AP_NAME,reason_code=previousAuthNotValid} = 113

(I've not done any summary type of things yet in Prometheus, so my strategy may be off).

Is this something that is achievable using snmp_exporter (I don't think it does), or should I expect to create my own exporter for this? If people think it may be something of value, I could have a stab at extending snmp_exporter (and the generator....), although I haven't done (much) Golang.

The SNMP exporter doesn't do any math, but if you've already done the expensive bit of fetching all that per-client data from the WLC then there's not too much point optimising the later bits. PromQL's count_values aggregator can cover the rest.

Brian
 
--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f1379e1b-3738-4024-9829-1d43ed305c54%40googlegroups.com.


--

Cameron Kerr

unread,
Dec 15, 2019, 6:21:46 PM12/15/19
to Prometheus Users


On Monday, 16 December 2019 11:19:59 UTC+13, Brian Brazil wrote:
On Sun, 15 Dec 2019 at 22:03, Cameron Kerr <camero...@gmail.com> wrote:
I am looking at data from Cisco Wireless Controllers on a very large wireless site; let's say on the order of 5,000 APs, with obviously many many more clients.

My main value I'm wanting to get from this is to try and find APs / Buildings that have poor service quality (how best to define that, I'm not yet sure). To assist me in this I've already got snmp_exporter configured to poll the WLC (Wireless LAN Controllers) getting information about APs, and using some metric rewriting to extract the building code from the AP name.

To make this more concrete, there is an SNMP table CISCO-LWAPP-DOT11-CLIENT-MIB:cldcClientTable, which contains one row for each client. I certainly don't want that kind of cardinality problem, so I'm wanting to get some summary data. One of the columns is cldcClientReasonCode, which gives the reason for a client being disassociated, and is one of the columns I would be most interesting in exposing.

What I think I would like is a metric like the following:

cldcClientTable_summary_ClientReasonCode{ap=AP_NAME,reason_code=previousAuthNotValid} = 12
cldcClientTable_summary_ClientReasonCode{ap=AP_NAME,reason_code=disassociationStaHasLeft} = 23
cldcClientTable_summary_count{ap=AP_NAME,reason_code=previousAuthNotValid} = 113

(I've not done any summary type of things yet in Prometheus, so my strategy may be off).

Is this something that is achievable using snmp_exporter (I don't think it does), or should I expect to create my own exporter for this? If people think it may be something of value, I could have a stab at extending snmp_exporter (and the generator....), although I haven't done (much) Golang.

The SNMP exporter doesn't do any math, but if you've already done the expensive bit of fetching all that per-client data from the WLC then there's not too much point optimising the later bits. PromQL's count_values aggregator can cover the rest.

I haven't fetched anything from that table yet; the cardinality of fetching just the APs gave me enough of a concern, I didn't want to create a cardinality explosion from a table keyed off a client MAC address. From an SNMP stand-point, I would need to fetch all the rows (which could still an issue perhaps, but not one affecting Prometheus).

Thanks for your prompt reply though, it at least helps give me some closure on what I thought was the likely case.

Cheers,
Cameron

 

Ben Kochie

unread,
Dec 15, 2019, 6:34:17 PM12/15/19
to Cameron Kerr, Prometheus Users
That is going to be quite a lot of cardinality. And yes, you'll have to ingest it all to generate summary data like you're looking for in PromQL.

I wonder if there is any other API that Cisco provides other than SNMP to gather this data. Then you can build a more custom exporter to produce this data.

What I might do is to try doing some scrapes manually with curl, just to get timing and example data, to see how much load it generates.

One thing I noticed, is the examples you gave don't match the generator/MIB. What you're looking for is possible, but it's easier to understand and help with what you're trying to do if you stick to generator/MIB content.


Thanks for your prompt reply though, it at least helps give me some closure on what I thought was the likely case.

Cheers,
Cameron

 

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
Dec 16, 2019, 3:36:36 AM12/16/19
to Prometheus Users
> What I might do is to try doing some scrapes manually with curl, just to get timing and example data, to see how much load it generates.

I completely agree it would be helpful to see some examples of what the current MIB returns, before proposing any solutions for what the aggregated metrics might look like.

If you just want a count of the number of disconnection events of type X, maybe the MIB has it already somewhere.  If not: simply adding the per-client counters together probably won't work.  When a client vanishes from the table, its counter will vanish too, and the total will go down.  If the table is structured like that then you'd need to sum the increases of the counters.

Regarding cardinality: roughly how many clients total?  Even if we're talking 500,000 clients, it still might be reasonable to set up a separate prometheus instance(*) dedicated to doing the scraping, which you could use for your data mining.  Then if you aggregate using recording rules, your main prometheus could pull in the summary information from that.  Being able to analyze how individual clients move between APs could also give some useful insights.

Finally. if you really don't want to get prometheus to do this: rather than extend snmp_exporter, I'd be inclined to write a proxy which scrapes snmp_exporter, does the custom aggregation/filtering, and exposes the results.  It would be very simple to cobble together.

(*) or spread the APs across multiple prometheus instances, per site or building perhaps.

Ben Kochie

unread,
Dec 16, 2019, 5:50:46 AM12/16/19
to Brian Candler, Prometheus Users
On Mon, Dec 16, 2019 at 9:36 AM Brian Candler <b.ca...@pobox.com> wrote:
> What I might do is to try doing some scrapes manually with curl, just to get timing and example data, to see how much load it generates.

I completely agree it would be helpful to see some examples of what the current MIB returns, before proposing any solutions for what the aggregated metrics might look like.

If you just want a count of the number of disconnection events of type X, maybe the MIB has it already somewhere.  If not: simply adding the per-client counters together probably won't work.  When a client vanishes from the table, its counter will vanish too, and the total will go down.  If the table is structured like that then you'd need to sum the increases of the counters.

Regarding cardinality: roughly how many clients total?  Even if we're talking 500,000 clients, it still might be reasonable to set up a separate prometheus instance(*) dedicated to doing the scraping, which you could use for your data mining.  Then if you aggregate using recording rules, your main prometheus could pull in the summary information from that.  Being able to analyze how individual clients move between APs could also give some useful insights.

500k on a single index is pretty big. My usual recommendation is to keep a single label value set in the 10-20k range. Maybe with a dedicated instance just for this metric, but that seems like an awful lot of effort for this data. Especially given the counting problems you mention. This really should be something counted as a real AP metric.
 

Finally. if you really don't want to get prometheus to do this: rather than extend snmp_exporter, I'd be inclined to write a proxy which scrapes snmp_exporter, does the custom aggregation/filtering, and exposes the results.  It would be very simple to cobble together.

Yea, it's a bit out of scope for the exporter. It's designed to just a raw pass-through, with the minimum amount of data manipulation in order to work to SNMP spec and work around a few vendor bugs.

I don't have regular access to WLCs, so I won't be much help there.
 

(*) or spread the APs across multiple prometheus instances, per site or building perhaps.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Daniel Swarbrick

unread,
Dec 17, 2019, 5:33:34 PM12/17/19
to Prometheus Users
Reply all
Reply to author
Forward
0 new messages