snmp_exporter returning errors for some nodes with same config but others work

2,539 views
Skip to first unread message

gth...@gmail.com

unread,
Oct 25, 2018, 11:02:00 AM10/25/18
to Prometheus Users
Just when I think I have it figured out.  Expect to be banned from here soon.  With the help of several individuals here I managed to get  Prometheus working for several UPS' I am testing against.  I picked six nodes to test and thought everything was working correctly.  While running queries I realized that one was missing.  I run the snmp_exporter in debug for now since I am still testing and saw no errors.  The output showed the suspect node being successfully walked like all the other working nodes.  I then checked the Status > Targets paged and it showed this node was throwing an server returned HTTP status 500 Internal Server Error.

Trying to narrow down the issue I picked three nodes that all have the same configuration.  I have triple checked the community string is correct on all three.  I have even run an snmpwalk against each node and on everyone I get the correct, expected results:

snmpwalk -v2c -m +PowerNet-MIB -c mycommunitystring 192.168.16.22 1.3.6.1.4.1.318.1.1.1.1.1.2
PowerNet-MIB::upsBasicIdentName.0 = STRING: "UPS_IDEN"

I then placed those three nodes as targets in my prometheus.yml and started the exporter.  The debug output showed that every node was being walked:

time="2018-10-24T16:01:42-05:00" level=debug msg="Scraping target '192.168.16.29' with module 'apcups'" source="main.go:86"
time="2018-10-24T16:01:42-05:00" level=debug msg="Getting 3 OIDs from target \"192.168.16.29\"" source="collector.go:103"
time="2018-10-24T16:01:42-05:00" level=debug msg="Get of 3 OIDs completed in 34.807046ms" source="collector.go:109"
time="2018-10-24T16:01:42-05:00" level=debug msg="Walking target \"192.168.16.29\" subtree \"1.3.6.1.2.1.2\"" source="collector.go:133"
time="2018-10-24T16:01:42-05:00" level=debug msg="Walk of target \"192.168.16.29\" subtree \"1.3.6.1.2.1.2\" completed in 544.912645ms" source="collector.go:143"
time="2018-10-24T16:01:42-05:00" level=debug msg="Walking target \"192.168.16.29\" subtree \"1.3.6.1.4.1.318.1.1.1.12\"" source="collector.go:133"
time="2018-10-24T16:01:49-05:00" level=debug msg="Scraping target '192.168.16.22' with module 'apcups'" source="main.go:86"
time="2018-10-24T16:01:49-05:00" level=debug msg="Getting 3 OIDs from target \"192.168.16.22\"" source="collector.go:103"
time="2018-10-24T16:01:49-05:00" level=debug msg="Get of 3 OIDs completed in 100.669286ms" source="collector.go:109"
time="2018-10-24T16:01:49-05:00" level=debug msg="Walking target \"192.168.16.22\" subtree \"1.3.6.1.2.1.2\"" source="collector.go:133"
time="2018-10-24T16:01:50-05:00" level=debug msg="Walk of target \"192.168.16.22\" subtree \"1.3.6.1.2.1.2\" completed in 506.247039ms" source="collector.go:143"
time="2018-10-24T16:01:50-05:00" level=debug msg="Walking target \"192.168.16.22\" subtree \"1.3.6.1.4.1.318.1.1.1.12\"" source="collector.go:133"
time="2018-10-24T16:01:50-05:00" level=debug msg="Walk of target \"192.168.16.29\" subtree \"1.3.6.1.4.1.318.1.1.1.12\" completed in 7.428172487s" source="collector.go:143"
time="2018-10-24T16:01:50-05:00" level=debug msg="Walking target \"192.168.16.29\" subtree \"1.3.6.1.4.1.318.1.1.1.2\"" source="collector.go:133"
time="2018-10-24T16:01:50-05:00" level=debug msg="Walk of target \"192.168.16.29\" subtree \"1.3.6.1.4.1.318.1.1.1.2\" completed in 423.601772ms" source="collector.go:143"
time="2018-10-24T16:01:50-05:00" level=debug msg="Walking target \"192.168.16.29\" subtree \"1.3.6.1.4.1.318.1.1.1.3\"" source="collector.go:133"
time="2018-10-24T16:01:51-05:00" level=debug msg="Walk of target \"192.168.16.22\" subtree \"1.3.6.1.4.1.318.1.1.1.12\" completed in 818.196309ms" source="collector.go:143"
time="2018-10-24T16:01:51-05:00" level=debug msg="Walking target \"192.168.16.22\" subtree \"1.3.6.1.4.1.318.1.1.1.2\"" source="collector.go:133"
time="2018-10-24T16:01:51-05:00" level=debug msg="Walk of target \"192.168.16.29\" subtree \"1.3.6.1.4.1.318.1.1.1.3\" completed in 470.443788ms" source="collector.go:143"
time="2018-10-24T16:01:51-05:00" level=debug msg="Walking target \"192.168.16.29\" subtree \"1.3.6.1.4.1.318.1.1.1.4\"" source="collector.go:133"
time="2018-10-24T16:01:51-05:00" level=debug msg="Walk of target \"192.168.16.22\" subtree \"1.3.6.1.4.1.318.1.1.1.2\" completed in 317.104656ms" source="collector.go:143"
time="2018-10-24T16:01:51-05:00" level=debug msg="Walking target \"192.168.16.22\" subtree \"1.3.6.1.4.1.318.1.1.1.3\"" source="collector.go:133"
time="2018-10-24T16:01:51-05:00" level=debug msg="Walk of target \"192.168.16.29\" subtree \"1.3.6.1.4.1.318.1.1.1.4\" completed in 368.865275ms" source="collector.go:143"
time="2018-10-24T16:01:51-05:00" level=debug msg="Walking target \"192.168.16.29\" subtree \"1.3.6.1.4.1.318.1.1.1.7.2\"" source="collector.go:133"
time="2018-10-24T16:01:51-05:00" level=debug msg="Walk of target \"192.168.16.22\" subtree \"1.3.6.1.4.1.318.1.1.1.3\" completed in 383.5979ms" source="collector.go:143"
time="2018-10-24T16:01:51-05:00" level=debug msg="Walking target \"192.168.16.22\" subtree \"1.3.6.1.4.1.318.1.1.1.4\"" source="collector.go:133"
time="2018-10-24T16:01:52-05:00" level=debug msg="Scraping target '192.168.16.33' with module 'apcups'" source="main.go:86"



Then I checked the Targets page and saw even more issues:

snmp (1/3 up) 


EndpointStateLabelsLast ScrapeError
http://127.0.0.1:9116/snmp
module="apcups" target="192.168.16.22"
DOWNinstance="192.168.16.22"15.566s agoserver returned HTTP status 500 Internal Server Error
http://127.0.0.1:9116/snmp
module="apcups" target="192.168.16.29"
DOWNinstance="192.168.16.29"22.81s agocontext deadline exceeded
http://127.0.0.1:9116/snmp
module="apcups" target="192.168.16.33"
UPinstance="192.168.16.33"13.169s ago

So one is working and the other two seem to be experiencing different issues.  I am confused why the walk works fine from the command line.  Don't the cli tools and the exporter both use the same net-snmp libraries?  I know that SNMP does not return much information in regards to troubleshooting so not sure what to try next.  The age of these nodes does differ, but they are all on the same firmware revision level.

As usual, I appreciate everyone's time on this and hope this is the last time I bug you doubt it.

Ben Kochie

unread,
Oct 25, 2018, 11:14:50 AM10/25/18
to Glen Huey, promethe...@googlegroups.com
You're not going to get banned asking SNMP questions. ;-)

No, the snmp_exporter doesn't use net-snmp at all. It has a completely different implementation of the SNMP protocol. The only use of net-snmp is for parsing the MIBs in the generator.

The 500 error is coming from the exporter, it's likely `curl -v` would show you the same code.

The "context deadline exceeded" means it took too long to scrape. The default scrape_timeout is 10s.

I would look at `scrape_duration_seconds` to see how close you are to the walk limit.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/b984cdbe-cd5d-4530-8eac-c60d6d3ad4a3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

gth...@gmail.com

unread,
Oct 25, 2018, 12:17:25 PM10/25/18
to Prometheus Users
Thanks Ben.  I am afraid I am unsure how to expose the "scrape_duration_seconds" event.  When I attempt to curl what I believe is the correct URL I am having issues:


curl -v http://myserver:9116/snmp?module=apcups&target=192.168.16.33

[root@myserver snmp_exporter]# * About to connect() to myserver port 9116 (#0)
*   Trying fe80::95f6:b70:e85c:ead3...
* Connected to myserver (fe80::95f6:b70:e85c:ead3) port 9116 (#0)
> GET /snmp?module=apcups HTTP/1.1
> User-Agent: curl/7.29.0
> Host: myserver:9116
> Accept: */*
>
< HTTP/1.1 400 Bad Request
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Thu, 25 Oct 2018 16:12:09 GMT
< Content-Length: 37
<
'target' parameter must be specified
* Connection #0 to host myserver left intact

If I place the target variable first I get further, but it returns a timeout even on the IP that works so I believe it is more something incorrect with my syntax:

[root@myserver snmp_exporter]# * About to connect() to myserver port 9116 (#0)
*   Trying fe80::95f6:b70:e85c:ead3...
* Connected to myserver (fe80::95f6:b70:e85c:ead3) port 9116 (#0)
> GET /snmp?target=192.168.16.29 HTTP/1.1
> User-Agent: curl/7.29.0
> Host: myserver:9116
> Accept: */*
>
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< X-Content-Type-Options: nosniff
< Date: Thu, 25 Oct 2018 16:15:33 GMT
< Content-Length: 239
<
An error has occurred during metrics gathering:

error collecting metric Desc{fqName: "snmp_error", help: "Error scraping target", constLabels: {}, variableLabels: []}: Error getting target 192.168.16.29: Request timeout (after 3 retries)
* Connection #0 to host myserver left intact

Ben Kochie

unread,
Oct 25, 2018, 12:56:35 PM10/25/18
to Glen Huey, promethe...@googlegroups.com
scrape_duration_seconds is an automatically created Prometheus metric for every target. You should be able to just query for it in the Prometheus UI.

There are several of these built-in automatic metrics for tracking the health of Prometheus scraping.


gth...@gmail.com

unread,
Oct 25, 2018, 4:43:45 PM10/25/18
to Prometheus Users
Ahh...well that makes things easier.  Unfortunately, there are no metrics for the two nodes that are experiencing issues?  Anything else I can try to narrow down the issues.  The timings in the debug output for the one that works does not seem that far off from the two that do not?

Ben Kochie

unread,
Oct 26, 2018, 6:08:25 AM10/26/18
to Glen Huey, promethe...@googlegroups.com
All targets will have those metrics. They're generated on the server side.

gth...@gmail.com

unread,
Oct 26, 2018, 9:46:11 AM10/26/18
to Prometheus Users
The snmp_exporter is running on the same host that prometheus is running from just to clarify.  I don't see the results for the two problematic nodes however:

snmp1.PNG


Here is what I get with either of the problem nodes:

snmp2.PNG

Ben Kochie

unread,
Oct 27, 2018, 3:27:55 AM10/27/18
to Glen Huey, promethe...@googlegroups.com
Remove the snmp_ prefix, as that's specific to the exporter for valid results. It's just scrape_duration_seconds.

gth...@gmail.com

unread,
Oct 29, 2018, 11:00:00 AM10/29/18
to Prometheus Users
Looks like you were dead on:

ElementValue
scrape_duration_seconds{instance="192.168.16.18",job="snmp"}9.264275752
scrape_duration_seconds{instance="192.168.16.20",job="snmp"}8.267845698
scrape_duration_seconds{instance="192.168.16.22",job="snmp"}10.000491527
scrape_duration_seconds{instance="192.168.16.26",job="snmp"}8.57521429
scrape_duration_seconds{instance="192.168.16.29",job="snmp"}10.000354781
scrape_duration_seconds{instance="192.168.16.33",job="snmp"}2.745934816
scrape_duration_seconds{instance="192.168.16.6",job="snmp"}9.571652629999999
scrape_duration_seconds{instance="192.168.16.7",job="snmp"}9.62274709
scrape_duration_seconds{instance="192.168.16.9",job="snmp"}10.000474457
scrape_duration_seconds{instance="localhost:9090",job="prometheus"}0.005037188
scrape_duration_seconds{instance="localhost:9100",job="node_exporter"}0.000745088

I see there is a scrape_timeout that can be defined at the scraper level in the prometheus.yml.  Would this have any affect or is there a timeout variable that can be defined at the snmp_exporter level?  Assuming the default of 10s can be adjusted at all.  Thanks again for the help I believe I am almost there.

Ben Kochie

unread,
Oct 29, 2018, 11:33:12 AM10/29/18
to Glen Huey, promethe...@googlegroups.com
The snmp_exporter has its own timeout controls, the default is 20 seconds per walk, not per scrape (I think), so it should be fine. You can adjust the scrape_timeout in your prometheus.yml just fine.

There's an open issue to make this more automatic[0].


gth...@gmail.com

unread,
Oct 29, 2018, 2:53:54 PM10/29/18
to Prometheus Users
Ok Ben, I set the interval to 60s and the scraper timeout to 30s and this resolved the issue for the node with the "context deadline exceeded" problem.  Unfortunately, still have one node suffering the server returned HTTP status 500 Internal Server Error. Would you like me to make a separate thread for that issue?

Some solid progress.  Thanks again for info.

yiyufeng1986

unread,
Feb 14, 2019, 1:38:58 AM2/14/19
to gth...@gmail.com, Prometheus Users


发自我的 iPad

在 2018年10月26日,00:17,gth...@gmail.com 写道:

Justin Counsell

unread,
Oct 14, 2020, 10:21:22 PM10/14/20
to Prometheus Users
any chance u can link your prometheus.yml
Reply all
Reply to author
Forward
0 new messages