snmp exporter periodoic timeouts when walking citrix netscaler

49 views

Skip to first unread message

Justin Teare

unread,

Jun 7, 2020, 7:39:20 PM6/7/20

to Prometheus Users

Hi all, I have been running into some strange snmp walk timeout issues with snmp exporter against citrix netscaler appliances.

Running latest (0.18.0) snmp exporter as a docker container.

If I try to walk the "vServer" or other similar metrics which have a time series for each vserver (as opposed to e.g. netscaler appliance cpu metrics), the walks are failing due to timeouts in a bizzarely periodic way. We currently have around ~420 vservers on each load balancer.

Behaviour

The snmp exporter will fail to walk the netscaler at approx 15 mins past the hour every hour, and will not walk again correctly for 15-20 mins. I am walking 2 netscalers, and the scrapes fail on both netscalers at the same time. One resumes walking after about 15 mins, while the other takes about 25 min to resume walking. Image shows "snmp_scrape_duration_seconds" for the netscaler module from the Prometheus interface.

The problem is not with Prometheus as you can observe the timeouts when targeting the netscaler from the SNMP exporter web interface which reports the following error:

An error has occurred while serving metrics:

error collecting metric Desc{fqName: "snmp_error", help: "Error scraping target", constLabels: {}, variableLabels: []}: error walking target example.com: Request timeout (after 3 retries)

The logs for the snmp generator container show this error:

level=info ts=2020-06-07T23:28:20.946Z caller=collector.go:224 module=citrix_adc target=example.com msg="Error scraping target" err="scrape canceled (possible timeout) walking target example.com"

A few days ago I was using snmp exporter version 0.17.0 and the error was more along the lines of `context canceled`. I realise there were some updates to timeouts made in the latest update but that doesn't seem to be helping in this situation (see more info about my timeout settings further below).

No noticible problems are happening from the netscaler's perspective, these are production appliances and everything is runninng fine.

I am not sure if this is an snmp exporter related problem or a netscaler related problem.

I have done testing from the command line to confirm snmp the netscaler is still responding. This command takes longer than during the 'non-timeout' period, but it does not time out or fail. The fact that I can run `snmpbulkwalk` on the entire `vserver` table from my command line and get no timeout error during the same period makes me think it's smnp exporter related, whereas the fact that it happens on a regular periodic cycle makes me think it could be something that's happening on the netsclaer.

If I generate a new minimal snmp.conf during the 'timeout period' with the vserver related OID's removed and just leave e.g. netsclaer cpu stats, the walks will resume straight away.

When I time the running `snmpbulkwalk` on the verserver table (using linux "time" command") from the command line it normally records about 3s to run. During the weird hourly 'timeout' period it takes about 6 seconds.

Changing my `timeout` or `max_repetitions` does not seem to have any effect as I have tried setting timeout value > 30s, and both increasing and decreasing the `max_repetitions` and it still fails. The snmp exporter fails to walk one column of a table, while I can walk the entire table with no failure from the command line.

I cannot see any reference to setting of snmp timeouts or rate limiting on the netscaler.

Can anyone help me narrow down if this is an snmp exporter issue or a netscaler issue?

Thanks.

Ben Kochie

unread,

Jun 8, 2020, 1:15:01 AM6/8/20

to Justin Teare, Prometheus Users

What is your scrape interval and scrape timeout on the Prometheus side? Prometheus sends a default scrape timeout of 10s to the exporter. The exporter timeout is only used if the timeout from the Prometheus server is longer.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/2740b34d-8ae3-4733-9946-740a8f0f9288o%40googlegroups.com.

Justin Teare

unread,

Jun 8, 2020, 6:17:26 PM6/8/20

to Prometheus Users

Thanks Ben,

That's good info to know. Looks like my scrape timeout is not set on that scrape job config. However, like I said the walk is failing on the snmp exporter when I query the target directly via the snmp exporter web interface, so it's timing out based on it's own timeout settings. I verified by timing it going through the default 3 retries which changes depending on what timeout I set in the snmp generator config.

On Monday, June 8, 2020 at 5:15:01 PM UTC+12, Ben Kochie wrote:

What is your scrape interval and scrape timeout on the Prometheus side? Prometheus sends a default scrape timeout of 10s to the exporter. The exporter timeout is only used if the timeout from the Prometheus server is longer.

On Mon, Jun 8, 2020 at 1:39 AM Justin Teare <justi...@gmail.com> wrote:

Hi all, I have been running into some strange snmp walk timeout issues with snmp exporter against citrix netscaler appliances.

Running latest (0.18.0) snmp exporter as a docker container.

If I try to walk the "vServer" or other similar metrics which have a time series for each vserver (as opposed to e.g. netscaler appliance cpu metrics), the walks are failing due to timeouts in a bizzarely periodic way. We currently have around ~420 vservers on each load balancer.

Behaviour

The snmp exporter will fail to walk the netscaler at approx 15 mins past the hour every hour, and will not walk again correctly for 15-20 mins. I am walking 2 netscalers, and the scrapes fail on both netscalers at the same time. One resumes walking after about 15 mins, while the other takes about 25 min to resume walking. Image shows "snmp_scrape_duration_seconds" for the netscaler module from the Prometheus interface.
The problem is not with Prometheus as you can observe the timeouts when targeting the netscaler from the SNMP exporter web interface which reports the following error:
An error has occurred while serving metrics: error collecting metric Desc{fqName: "snmp_error", help: "Error scraping target", constLabels: {}, variableLabels: []}: error walking target example.com: Request timeout (after 3 retries)
The logs for the snmp generator container show this error:

level=info ts=2020-06-07T23:28:20.946Z caller=collector.go:224 module=citrix_adc target=example.com msg="Error scraping target" err="scrape canceled (possible timeout) walking target example.com"

A few days ago I was using snmp exporter version 0.17.0 and the error was more along the lines of `context canceled`. I realise there were some updates to timeouts made in the latest update but that doesn't seem to be helping in this situation (see more info about my timeout settings further below).

No noticible problems are happening from the netscaler's perspective, these are production appliances and everything is runninng fine.

I am not sure if this is an snmp exporter related problem or a netscaler related problem.

I have done testing from the command line to confirm snmp the netscaler is still responding. This command takes longer than during the 'non-timeout' period, but it does not time out or fail. The fact that I can run `snmpbulkwalk` on the entire `vserver` table from my command line and get no timeout error during the same period makes me think it's smnp exporter related, whereas the fact that it happens on a regular periodic cycle makes me think it could be something that's happening on the netsclaer.
If I generate a new minimal snmp.conf during the 'timeout period' with the vserver related OID's removed and just leave e.g. netsclaer cpu stats, the walks will resume straight away.

When I time the running `snmpbulkwalk` on the verserver table (using linux "time" command") from the command line it normally records about 3s to run. During the weird hourly 'timeout' period it takes about 6 seconds.

Changing my `timeout` or `max_repetitions` does not seem to have any effect as I have tried setting timeout value > 30s, and both increasing and decreasing the `max_repetitions` and it still fails. The snmp exporter fails to walk one column of a table, while I can walk the entire table with no failure from the command line.

I cannot see any reference to setting of snmp timeouts or rate limiting on the netscaler.

Can anyone help me narrow down if this is an snmp exporter issue or a netscaler issue?

Thanks.
--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.

To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Reply all

Reply to author

Forward

0 new messages