Hi all, I have been running into some strange snmp walk timeout issues with snmp exporter against citrix netscaler appliances.
Running latest (0.18.0) snmp exporter as a docker container.
If I try to walk the "vServer" or other similar metrics which have a time series for each vserver (as opposed to e.g. netscaler appliance cpu metrics), the walks are failing due to timeouts in a bizzarely periodic way. We currently have around ~420 vservers on each load balancer.
Behaviour
The snmp exporter will fail to walk the netscaler at approx 15 mins past the hour every hour, and will not walk again correctly for 15-20 mins. I am walking 2 netscalers, and the scrapes fail on both netscalers at the same time. One resumes walking after about 15 mins, while the other takes about 25 min to resume walking. Image shows "snmp_scrape_duration_seconds" for the netscaler module from the Prometheus interface.

The problem is not with Prometheus as you can observe the timeouts
when targeting the netscaler from the SNMP exporter web interface which reports the following error:
An error has occurred while serving metrics:
error collecting metric Desc{fqName: "snmp_error", help: "Error scraping target", constLabels: {}, variableLabels: []}: error walking target example.com: Request timeout (after 3 retries)
The logs for the snmp generator container show this error:
level=info ts=2020-06-07T23:28:20.946Z caller=collector.go:224 module=citrix_adc target=example.com msg="Error scraping target" err="scrape canceled (possible timeout) walking target example.com"
A few days ago I was using snmp exporter version 0.17.0 and the error was more along the lines of `context canceled`. I realise there were some updates to timeouts made in the latest update but that doesn't seem to be helping in this situation (see more info about my timeout settings further below).
No noticible problems are happening from the netscaler's perspective, these are production appliances and everything is runninng fine.
I am not sure if this is an snmp exporter related problem or a netscaler related problem.
I have done testing from the command line to confirm snmp the netscaler is still responding. This command takes longer than during the 'non-timeout' period, but it does not time out or fail. The fact that I can run `snmpbulkwalk` on the entire `vserver` table from my command line and get no timeout error during the same period makes me think it's smnp exporter related, whereas the fact that it happens on a regular periodic cycle makes me think it could be something that's happening on the netsclaer.
If I generate a new minimal snmp.conf during the 'timeout period' with the vserver related OID's removed and just leave e.g. netsclaer cpu stats, the walks will resume straight away.
When I time the running `snmpbulkwalk` on the verserver table (using
linux "time" command") from the command line it normally records about
3s to run. During the weird hourly 'timeout' period it takes about 6 seconds.
Changing my `timeout` or `max_repetitions` does not seem to have any effect as I have tried setting timeout value > 30s, and both increasing and decreasing the `max_repetitions` and it still fails. The snmp exporter fails to walk one column of a table, while I can walk the entire table with no failure from the command line.
I cannot see any reference to setting of snmp timeouts or rate limiting on the netscaler.
Can anyone help me narrow down if this is an snmp exporter issue or a netscaler issue?
Thanks.