snmp_exporter inaccurate metrics

625 views
Skip to first unread message

Betts Wang

unread,
Oct 3, 2022, 11:25:23 PM10/3/22
to Prometheus Users
Hi,Guys. I have a issue about snmp_exporter. I found that the value  inquired by snmpbulkwalk(or snmpbulkget) is different with the value inquired by snmp_export. And
the value inquired bu snmp_export didn't change within 1 minute.

Here are the observations, and tests are on almost the same time. But the different value is huge,could be 66279141459   ×   8 bit = 530Gbit. The interface is only 100Gbit

inquired by snmp-export
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.081891198125584e+15

inquired by snmpbulkwalk
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4081957477267043

root@prometheus:/usr/local/snmp_exporter# snmpbulkwalk -V
NET-SNMP version: 5.8

the snmp_export version is the lastest ( 0.20.0 / 2021-02-12 )






Betts Wang

unread,
Oct 3, 2022, 11:27:37 PM10/3/22
to Prometheus Users
the snmp_export config
root@prometheus:/usr/local/snmp_exporter# cat /usr/local/snmp_exporter/snmp.yml
# WARNING: This file was auto-generated using snmp_exporter generator, manual changes will be lost.
router_switch:
  walk:
  - 1.3.6.1.2.1.31.1.1.1.1
  - 1.3.6.1.2.1.31.1.1.1.10
  - 1.3.6.1.2.1.31.1.1.1.6
  - 1.3.6.1.4.1.2011.5.25.31.1.1.1.1.11
  - 1.3.6.1.4.1.2011.5.25.31.1.1.1.1.5
  - 1.3.6.1.4.1.2011.5.25.31.1.1.1.1.7
  - 1.3.6.1.4.1.56813.5.25.31.1.1.1.1.5
  - 1.3.6.1.4.1.56813.5.25.31.1.1.1.1.7
  - 1.3.6.1.4.1.56813.5.25.31.1.1.1.1.11
  get:
  - 1.3.6.1.2.1.1.3.0
  - 1.3.6.1.2.1.1.5.0
  metrics:
  - name: sysUpTime
    oid: 1.3.6.1.2.1.1.3
    type: gauge
    help: The time (in hundredths of a second) since the network management portion
      of the system was last re-initialized. - 1.3.6.1.2.1.1.3
  - name: sysName
    oid: 1.3.6.1.2.1.1.5
    type: DisplayString
    help: An administratively-assigned name for this managed node - 1.3.6.1.2.1.1.5
  - name: ifHCOutOctets
    oid: 1.3.6.1.2.1.31.1.1.1.10
    type: counter
    help: The total number of octets transmitted out of the interface, including framing
      characters - 1.3.6.1.2.1.31.1.1.1.10
    indexes:
    - labelname: ifIndex
      type: gauge
    lookups:
    - labels:
      - ifIndex
      labelname: ifName
      oid: 1.3.6.1.2.1.31.1.1.1.1
      type: DisplayString
  - name: ifHCInOctets
    oid: 1.3.6.1.2.1.31.1.1.1.6
    type: counter
    help: The total number of octets received on the interface, including framing
      characters - 1.3.6.1.2.1.31.1.1.1.6
    indexes:
    - labelname: ifIndex
      type: gauge
    lookups:
    - labels:
      - ifIndex
      labelname: ifName
      oid: 1.3.6.1.2.1.31.1.1.1.1
      type: DisplayString
  - name: S5735CPUUsage
    oid: 1.3.6.1.4.1.56813.5.25.31.1.1.1.1.5
    type: gauge
    help: This object indicates the entity CPU Usage - 1.3.6.1.4.1.56813.5.25.31.1.1.1.1.5.67108873
    indexes:
    - labelname: entPhysicalIndex
      type: gauge
  - name: S5735MEMUsage
    oid: 1.3.6.1.4.1.56813.5.25.31.1.1.1.1.7
    type: gauge
    help: This object indicates the entity MEM Usage - 1.3.6.1.4.1.56813.5.25.31.1.1.1.1.7.671008873
    indexes:
    - labelname: entPhysicalIndex
      type: gauge
  - name: S5735Temperature
    oid: 1.3.6.1.4.1.56813.5.25.31.1.1.1.1.11
    type: gauge
    help: This object indicates the entity temperature - 1.3.6.1.4.1.56813.5.25.31.1.1.1.1.11.671008873
    indexes:
    - labelname: entPhysicalIndex
      type: gauge
  - name: hwEntityTemperature
    oid: 1.3.6.1.4.1.2011.5.25.31.1.1.1.1.11
    type: gauge
    help: This object indicates the entity temperature - 1.3.6.1.4.1.2011.5.25.31.1.1.1.1.11
    indexes:
    - labelname: entPhysicalIndex
      type: gauge
  - name: hwEntityCpuUsage
    oid: 1.3.6.1.4.1.2011.5.25.31.1.1.1.1.5
    type: gauge
    help: This object indicates the CPU usage of an entity - 1.3.6.1.4.1.2011.5.25.31.1.1.1.1.5
    indexes:
    - labelname: entPhysicalIndex
      type: gauge
  - name: hwEntityMemUsage
    oid: 1.3.6.1.4.1.2011.5.25.31.1.1.1.1.7
    type: gauge
    help: This object indicates the memory usage of an entity, that is, the percentage
      of the memory that has been used. - 1.3.6.1.4.1.2011.5.25.31.1.1.1.1.7
    indexes:
    - labelname: entPhysicalIndex
      type: gauge
  version: 3
  max_repetitions: 25
  retries: 3
  timeout: 5s
  auth:
    security_level: authPriv
    username: *****
    password: *****
    auth_protocol: MD5
    priv_protocol: AES
    priv_password: *****

Betts Wang

unread,
Oct 3, 2022, 11:32:48 PM10/3/22
to Prometheus Users
the promethues data is below . The scrape_interval is set to 10s.

1.jpg

- job_name: 'outside-device'
  scrape_interval: 10s
  scrape_timeout: 5s
  static_configs:
  - targets:
    - 36.155.143.1
  metrics_path: /snmp
  params:
    module: [router_switch]
  relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9116

Ben Kochie

unread,
Oct 3, 2022, 11:35:36 PM10/3/22
to Betts Wang, Prometheus Users
It would be useful to get a tcpdump packet capture of the device responses for this. Without knowing the actual bytes the device responded with, it's hard to say what the issue is.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/d5d929fa-695e-4acf-9ccf-6fb7cfe82a0an%40googlegroups.com.

Betts Wang

unread,
Oct 4, 2022, 1:28:01 AM10/4/22
to Prometheus Users
Adjust the snmp config to match only one port of one device. Tcpdump had been done. The result is below. But I could not read it.

root@prometheus:~/generator/generator# tcpdump -nn -i enp6s18 udp port 161 and  host 36.155.143.1
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on enp6s18, link-type EN10MB (Ethernet), capture size 262144 bytes
13:13:03.412291 IP 10.0.1.150.51871 > 36.155.143.1.161:  F=r U="" E= C="" GetRequest(14)
13:13:03.413662 IP 36.155.143.1.161 > 10.0.1.150.51871:  F= U="" E=_80_00_07_db_03_3c_c7_86_19_64_81 C="" Report(30)  .1.3.6.1.6.3.15.1.1.4.0=54131
13:13:03.413796 IP 10.0.1.150.51871 > 36.155.143.1.161:  F=apr U="cpucloud" [!scoped PDU]9b_2f_6d_e6_6e_9b_73_f7_91_99_c9_bd_e6_00_b2_15_42_77_d7_bf_1b_ad_54_31_38_83_0d_0c_7e_c1_c9_0f_10_29_21_ea_34_aa_45_2a_59_15_c2_fb_ee_c1_4e_c2_2b_bc_d7_10_c2_31_f3_33_1e_60_41_95_fa_b6_b2_4b_35_7b_78
13:13:03.416033 IP 36.155.143.1.161 > 10.0.1.150.51871:  F=ap U="cpucloud" [!scoped PDU]be_a0_3f_09_b0_ee_c5_b4_97_c4_96_ec_11_46_02_76_8c_e5_90_03_2a_f9_d2_09_f8_8e_30_85_16_38_37_7f_a2_ec_06_40_9f_e2_e6_8a_eb_7c_dc_38_9e_61_08_03_0e_bd_f7_47_4f_73_ae_94_16_c4_30_dc_a1_59_fe_d8_d2_94_fa_c0_eb_3d_2c_bb_09_73_7a_f8_7b_af_2f_64_27_8e_09_51
13:13:05.176831 IP 10.0.1.150.36230 > 36.155.143.1.161:  F=r U="" E= C="" GetRequest(14)
13:13:05.178156 IP 36.155.143.1.161 > 10.0.1.150.36230:  F= U="" E=_80_00_07_db_03_3c_c7_86_19_64_81 C="" Report(30)  .1.3.6.1.6.3.15.1.1.4.0=54132
13:13:05.178297 IP 10.0.1.150.36230 > 36.155.143.1.161:  F=apr U="cpucloud" [!scoped PDU]44_45_71_67_db_38_a3_43_63_d7_05_08_ae_fb_f2_19_c3_20_2a_84_6b_07_e8_3a_97_6d_e4_88_d8_0a_4e_d5_02_d5_9b_3c_6c_59_8d_23_21_95_bb_15_96_1a_d4_28_98_61_10_d5_aa_38_80_9d_30_66_71_03_a6_aa_60_fa_81_ae_3d
13:13:05.180310 IP 36.155.143.1.161 > 10.0.1.150.36230:  F=ap U="cpucloud" [!scoped PDU]8c_ae_ea_01_12_6c_54_b7_52_04_7f_10_29_14_be_6c_fd_a3_e2_a0_b3_cf_e3_61_a5_72_f3_5a_da_04_18_ba_c6_ec_a6_b5_a5_cb_cc_0d_98_07_25_7e_2e_26_79_1a_3e_72_00_33_42_d7_7a_ef_79_d6_57_5c_cd_9f_59_e4_e5_ce_79_f6_69_66_b0_84_d7_80_d6_08_64_c9_4f_9a_8e_e5_03_79
13:13:15.177469 IP 10.0.1.150.60887 > 36.155.143.1.161:  F=r U="" E= C="" GetRequest(14)
13:13:15.178844 IP 36.155.143.1.161 > 10.0.1.150.60887:  F= U="" E=_80_00_07_db_03_3c_c7_86_19_64_81 C="" Report(30)  .1.3.6.1.6.3.15.1.1.4.0=54133
13:13:15.179098 IP 10.0.1.150.60887 > 36.155.143.1.161:  F=apr U="cpucloud" [!scoped PDU]85_63_35_03_4c_07_ec_92_7a_7f_f1_ad_4b_da_de_f0_72_db_3a_72_0e_6c_53_e6_10_c4_2f_c0_52_52_c4_41_45_14_89_95_78_61_e1_9e_f4_03_49_e8_f9_80_12_41_1d_16_25_97_cf_3d_23_76_52_48_eb_c5_98_53_1a_b5_e6_a6_0d
13:13:15.181289 IP 36.155.143.1.161 > 10.0.1.150.60887:  F=ap U="cpucloud" [!scoped PDU]9c_2c_4f_e9_78_7b_ef_2f_68_4c_de_3b_b5_53_00_69_1d_7e_2a_e3_96_92_87_59_0d_9c_b2_df_bf_f9_2c_e4_5e_d7_a9_57_b6_9e_22_19_64_ce_59_7c_61_f7_21_1b_84_ca_7a_45_8f_ec_9b_9f_18_0c_a7_eb_80_a7_f0_bd_8f_23_4d_8d_21_ad_5f_07_29_da_e5_75_3c_5d_d6_f8_b0_dc_48_51

Betts Wang

unread,
Oct 4, 2022, 1:37:59 AM10/4/22
to Prometheus Users
It seem that the timestamp is right , every 10s one respond. What is wrong when I active other part of the config . Here is my promethues's config.yml

root@prometheus:~/generator/generator# cat /etc/prometheus/prometheus.yml
# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
           - localhost:9093
rule_files:
  - "/etc/prometheus/*.rules"

scrape_configs:

....... here are some server target

#- job_name: 'data_switch'
#  scrape_interval: 10s
#  scrape_timeout: 5s
#  static_configs:
#  - targets:
#    - 172.16.0.129
#    - 172.16.0.130
#    - 172.16.0.131
#    - 172.16.0.132
#    - 172.16.0.133
#    - 172.16.0.134
#    - 172.16.0.135
#  metrics_path: /snmp
#  params:
#    module: [router_switch]
#  relabel_configs:
#      - source_labels: [__address__]
#        target_label: __param_target
#      - source_labels: [__param_target]
#        target_label: instance
#      - target_label: __address__
#        replacement: 127.0.0.1:9116
#- job_name: 'mngt_switch'
#  scrape_interval: 10s
#  scrape_timeout: 5s
#  static_configs:
#  - targets:
#    - 172.16.0.161
#    - 172.16.0.162
#    - 172.16.0.163
#    - 172.16.0.164
#    - 172.16.0.165
#    - 172.16.0.166
#    - 172.16.0.167
#  metrics_path: /snmp
#  params:
#    module: [router_switch]
#  relabel_configs:
#      - source_labels: [__address__]
#        target_label: __param_target
#      - source_labels: [__param_target]
#        target_label: instance
#      - target_label: __address__
#        replacement: 127.0.0.1:9116
#- job_name: 'ipmi_switch'
#  scrape_interval: 10s
#  scrape_timeout: 5s
#  static_configs:
#  - targets:
#    - 172.16.0.193
#    - 172.16.0.194
#    - 172.16.0.195
#    - 172.16.0.196
#    - 172.16.0.197
#    - 172.16.0.198
#    - 172.16.0.199
#  metrics_path: /snmp
#  params:
#    module: [router_switch]
#  relabel_configs:
#      - source_labels: [__address__]
#        target_label: __param_target
#      - source_labels: [__param_target]
#        target_label: instance
#      - target_label: __address__
#        replacement: 127.0.0.1:9116

- job_name: 'outside-device'
  scrape_interval: 10s
  scrape_timeout: 5s
  static_configs:
  - targets:
    - 36.155.143.1
#    - 172.16.0.21
#    - 172.16.0.22

  metrics_path: /snmp
  params:
    module: [router_switch]
  relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9116
root@prometheus:~/generator/generator#

Brian Candler

unread,
Oct 4, 2022, 5:26:06 AM10/4/22
to Prometheus Users
On Tuesday, 4 October 2022 at 04:25:23 UTC+1 bett...@gmail.com wrote:
inquired by snmp-export
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.081891198125584e+15

inquired by snmpbulkwalk
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4081957477267043
 
Odd. Is the snmp_exporter output that you show viewed directly using "curl" or similar, without having gone via Prometheus or any other software first?

In hex those are:

000e 8075 5ae0 4610  # snmp_exporter
000e 8084 c96b b663  # snmpbulkwalk

respectively. At first I thought of #350 but it's not that.  I then wondered if these metrics are subject to some intermediate processing as a float32: 

No: this would give
4.0819576e+15
e8084d0000000
which is still very close to the correct value.

As others have said: I think a raw tcpdump is required. Add "-s0 -X" to get the full packet decode in hex.  But this won't work if you're using v3 authPriv, obviously.

One other thought: are you using the same version of SNMP for both snmp_exporter and snmpbulkwalk?  If one is using v3 and the other v2c (say), then it's conceivable this could tickle a bug in the device.

I'd also try using authNoPriv instead of authPriv; it's not impossible there's a decryption bug somewhere.

Betts Wang

unread,
Oct 4, 2022, 9:38:51 AM10/4/22
to Prometheus Users
在2022年10月4日星期二 UTC+8 17:26:06<Brian Candler> 写道:

 Odd. Is the snmp_exporter output that you show viewed directly using "curl" or similar, without having gone via Prometheus or any other software first?

I use web listening port 9116 of  snmp_export and manually snmpbulkwalk to inquire the value. Both they are on the same one system. 

In hex those are:
000e 8075 5ae0 4610  # snmp_exporter
000e 8084 c96b b663  # snmpbulkwalk

tcpdump is only for snmp_export when I change the snmp.yml  to only one port's oid. So their value will be normal value.



As others have said: I think a raw tcpdump is required. Add "-s0 -X" to get the full packet decode in hex.  But this won't work if you're using v3 authPriv, obviously.

Cause the network devices are on the produce enviroment ,  it's hard to change the authencation .
 

One other thought: are you using the same version of SNMP for both snmp_exporter and snmpbulkwalk?  If one is using v3 and the other v2c (say), then it's conceivable this could tickle a bug in the device.

yes , I use the same version snmp , only v3



there is strange when I change the snmp.yml to only one port's oid and promethues.yml to only one target. The promethus data is be normal . It meet the settiong of scrape_interval ( 10s). But when I resume the configs , the promethus data is unnormal , scrape_interval will be 1 minute one value.
 

Betts Wang

unread,
Oct 4, 2022, 9:46:24 AM10/4/22
to Prometheus Users
2.jpg3.jpg

Brian Candler

unread,
Oct 5, 2022, 3:44:49 AM10/5/22
to Prometheus Users
I can see from the graphs that the data appears to go up only once per minute, but that might be that the data from the device itself is only updated once per minute, not anything to do with scraping.

To see whether prometheus is actually scraping between those times, use a range vector query:

1. Select tab "Table" instead of "Graph"
2. Enter query: ifHCInOctets{...labels as before...}[5m]

This will show you timestamped data points between now and 5 minutes ago.  Each one of those points is an actual scrape with exact timestamp it took place.  If you see points 1m apart them the prometheus job is only scraping at 1m intervals.  If you see points 10s apart with identical values, then this is what the exporter is actually returning for each scrape.

The other possibility I can think of is that maybe some of your scrapes are hitting the scrape timeout (5s) and failing.  To check this, do another range vector query:

    up{job="outside-device",instance="36.155.143.1"}[5m]

Does it flip between 1 and 0?  Or is it 1 continuously?  And what's the interval between timestamps?

Visually, you can draw a graph of
    up{job="outside-device",instance="36.155.143.1"}
or
    min_over_time(up{job="outside-device",instance="36.155.143.1"}[1m])

> Cause the network devices are on the produce enviroment ,  it's hard to change the authencation .

Will the device accept an authNoPriv query anyway?  That is, can you try this?

snmpbulkwalk  -l authNoPriv x.x.x.x ifHCInOctets

If that works, then you can get the tcpdump data you require.  You'll still have to decode the ASN.1 packets by hand though (or find an ASN.1 decoder tool)

Just to clarify though: it seems you're seeing two different problems with this device?

1. when using multiple interfaces in your scrape job, ifHCInOctets appears to be incrementing only once every minute instead of once every 10 seconds, even though the scrape job is at 10 second intervals

2. the values returned by snmp_exporter appear to be significantly different to those from snmpbulkwalk

Is that correct?  If so we should probably take these one at a time, but they could be related: e.g. snmp_exporter may be seeing a value that's 1 minute out of date.

Thinks: is the difference between the two values roughly what you'd expect to see over a 1 minute period?  I can check this.  From your first mail:

inquired by snmp-export
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.081891198125584e+15

inquired by snmpbulkwalk
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4081957477267043

=> difference is 66279141459 (~6.6e10)

From the graphs you just posted, showing samples 1 minute apart:

first data point: 1137591073238070
second data point: 1137663709396632

=> difference is 72636158562 (~7.2e10)

Those values are very close, so that looks very likely indeed.

Is there any chance there is some HTTP cache between your prometheus server and your snmp_exporter?? Otherwise, maybe the target itself is doing some rate limiting on SNMP queries from the same client - or it only updates its MIB once per minute.

You could try running snmpbulkwalk every 10 seconds and see how the value changes.  At this point, I'm pretty sure it's not snmp_exporter which is at fault.

HTH,

Brian.

Betts Wang

unread,
Oct 6, 2022, 12:06:38 PM10/6/22
to Prometheus Users
Just to clarify though: it seems you're seeing two different problems with this device?

1. when using multiple interfaces in your scrape job, ifHCInOctets appears to be incrementing only once every minute instead of once every 10 seconds, even though the scrape job is at 10 second intervals

2. the values returned by snmp_exporter appear to be significantly different to those from snmpbulkwalk

Is that correct?  If so we should probably take these one at a time, but they could be related: e.g. snmp_exporter may be seeing a value that's 1 minute out of date.

Yes, the two points are  what I want to examine. I make sure that the graphs are incorrect, cause the traffic of the interface 100ge1/0/3  always exist. It cloudn't be zero within one minute.
 


Is there any chance there is some HTTP cache between your prometheus server and your snmp_exporter?? Otherwise, maybe the target itself is doing some rate limiting on SNMP queries from the same client - or it only updates its MIB once per minute.

No cache server bettween prometheus and snmp_exporter  and no limition bettween them.

You could try running snmpbulkwalk every 10 seconds and see how the value changes.  At this point, I'm pretty sure it's not snmp_exporter which is at fault.

Yes , I do the test below.


A test bash shell show below
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
#!/bin/bash
echo "" > result.txt
firsttime=$(date "+%Y-%m-%d %H:%M:%S")
echo $firsttime >> result.txt
# print current  ifHCOutOctets value
snmpbulkwalk -v3  -u cpucloud -l authpriv -a md5 -A ***** -x aes128 -X ***** 36.155.143.1 1.3.6.1.2.1.31.1.1.1.10.7 >> result.txt
sleep 10
secondtime=$(date "+%Y-%m-%d %H:%M:%S")
echo $secondtime >> result.txt
# print  ifHCOutOctets value after 10s
snmpbulkwalk -v3  -u cpucloud -l authpriv -a md5 -A ***** -x aes128 -X ***** 36.155.143.1 1.3.6.1.2.1.31.1.1.1.10.7 >> result.txt
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Do the test , the snmpbulkwalk show the interface 100ge1/0/3 rate is ~=18.3Gb/s
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
root@prometheus:~# bash test.sh
root@prometheus:~# cat result.txt
2022-10-06 23:40:25
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4588160205073820
2022-10-06 23:40:35
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4588183087237996
root@prometheus:~# bc
bc 1.07.1
Copyright 1991-1994, 1997, 1998, 2000, 2004, 2006, 2008, 2012-2017 Free Software Foundation, Inc.
This is free software with ABSOLUTELY NO WARRANTY.
For details type `warranty'.
4588183087237996-4588160205073820
22882164176
22882164176*8/10
18305731340

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


Check my router interface rate (~=17.3Gb/s) , it is very closely the value of snmpbulkwalk.
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
<JS-NJ-YD-OUT_ESW>dis int 100ge1/0/3 | include Last.10.*out
    Last 10 seconds output rate: 17313342948 bits/sec, 1651109 packets/sec
    Last 10 seconds output utility rate: 17.31%
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


At last ,  I compare the value of snmpbulkwalk and the value of promethues which is inquired from snmp_exporter. It is a huge different. ( 4588160205073820 - 4588095830969249 )* 8 / 10 ~=51Gb/s
12.jpg


 

Betts Wang

unread,
Oct 6, 2022, 8:28:50 PM10/6/22
to Prometheus Users
An addition about the snmp_exporter scrape time
------------------------------------------------------------------------------------------

6.jpg

Brian Candler

unread,
Oct 7, 2022, 4:04:23 AM10/7/22
to Prometheus Users
Can you run your script so that it does 6 calls to snmpbulkwalk, at 10 second intervals?

Betts Wang

unread,
Oct 7, 2022, 9:35:19 AM10/7/22
to Prometheus Users
what does "6 calls" mean?

Brian Candler

unread,
Oct 7, 2022, 1:30:49 PM10/7/22
to Prometheus Users
Just meant to run it 6 times:

snmpbulkwalk ...
sleep 10
snmpbulkwalk ...
sleep 10
snmpbulkwalk ...
sleep 10
snmpbulkwalk ...
sleep 10
snmpbulkwalk ...
sleep 10
snmpbulkwalk ...

Check if the result increments *every* time.

It seems there is something which is causing the value to "stick" for about a minute.  This test is to be sure it's not happening all the time, and is only when snmp_exporter is polling.

Brian Candler

unread,
Oct 7, 2022, 1:51:49 PM10/7/22
to Prometheus Users
n Tuesday, 4 October 2022 at 14:38:51 UTC+1 bett...@gmail.com wrote:
在2022年10月4日星期二 UTC+8 17:26:06<Brian Candler> 写道:

 Odd. Is the snmp_exporter output that you show viewed directly using "curl" or similar, without having gone via Prometheus or any other software first?

I use web listening port 9116 of  snmp_export and manually snmpbulkwalk to inquire the value.

I'm still not 100% sure what you mean by "I used web listening port 9116 of snmp_export[er]"

Can you show the exact command you used to query snmp_exporter?

Betts Wang

unread,
Oct 7, 2022, 9:50:27 PM10/7/22
to Prometheus Users
I'm still not 100% sure what you mean by "I used web listening port 9116 of snmp_export[er]"

That means I just view on localhost:9116 which is snmp_exporter listenning.  It doesn't matter with this issue at all. On the last test , I compare the value which snmpbulkwalk inquired with the promethues's on the same time.

Check if the result increments *every* time.

yes ,it increase everytime no matter how long it take.


root@prometheus:~# tail -f result.txt
2022-10-08 09:46:30
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4789324209167469
2022-10-08 09:46:40
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4789335137160464
2022-10-08 09:46:50
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4789346020795773
2022-10-08 09:47:00
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4789358229376258
2022-10-08 09:47:11
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4789369214777896
2022-10-08 09:47:21
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4789381355776256
2022-10-08 09:47:31
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4789392476954784
2022-10-08 09:47:41
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4789405206501824
2022-10-08 09:47:51
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4789416473911420
2022-10-08 09:48:01
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4789429186489733
2022-10-08 09:48:11
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4789440429492977

Brian Candler

unread,
Oct 8, 2022, 4:29:56 AM10/8/22
to Prometheus Users
On Saturday, 8 October 2022 at 02:50:27 UTC+1 bett...@gmail.com wrote:
I'm still not 100% sure what you mean by "I used web listening port 9116 of snmp_export[er]"

That means I just view on localhost:9116 which is snmp_exporter listenning.

Yes, but view using what? A web browser like Chrome or Firefox? curl? Something else?

What I'd like to see is that you write a curl-based script with a loop and 10 second intervals:

while true; do
  curl -gsS '127.0.0.1:9116/snmp?module=XXXX&target=36.155.143.1' | egrep 'ifHCInOctets.*100GE1/0/3'
  sleep 10
done

How does the value change over time?

Then modify the loop so that it does both curl *and* snmpbulkwalk next to each other, followed by a 10 second delay. 

while true; do
  curl -gsS '127.0.0.1:9116/snmp?module=XXXX&target=36.155.143.1' | egrep 'ifHCInOctets.*100GE1/0/3'
  snmpbulkwalk ....XXXX.... 36.155.143.1
  sleep 10
done

How do the adjacent values compare?

Betts Wang

unread,
Oct 8, 2022, 10:25:42 AM10/8/22
to Prometheus Users
Got. I do the test as your said. You can ses that snmp_exporter didn't increase every 10s and is not equal to the snmpbulkwalk at the same time. What's wrong with my snmp_exporter.? And the value on promethues just increase one minute one time.  So my grafana's graph is unnormal.

-------------------------------------------------------------------------------------------------------------------------------------------------------------
root@prometheus:~# cat test.sh
#!/bin/bash

echo "" > result.txt

while true

do


time=$(date "+%Y-%m-%d %H:%M:%S")

echo $time >> result.txt


snmpbulkwalk -v3  -u cpucloud -l authpriv -a md5 -A ***** -x aes128 -X ***** 36.155.143.1 1.3.6.1.2.1.31.1.1.1.10.7 >> result.txt

curl -gsS 'http://10.0.1.150:9116/snmp?target=36.155.143.1&module=router_switch' | egrep 'ifHCOutOctets.*100GE1/0/3' >> result.txt

sleep 10

done
----------------------------------------------------------------------------------------------------------------------------------------------------------------------

root@prometheus:~# cat result.txt

2022-10-08 22:05:31
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4872937394413293
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.872915830877599e+15
2022-10-08 22:05:42
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4872961501783827
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.872915830877599e+15
2022-10-08 22:05:54
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4872992651951767
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.872915830877599e+15
2022-10-08 22:06:05
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4873016362318930
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.872915830877599e+15
2022-10-08 22:06:16
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4873042287477782
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.873042287477782e+15
2022-10-08 22:06:28
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4873070559473757
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.873042287477782e+15
2022-10-08 22:06:39
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4873096766134004
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.873042287477782e+15
2022-10-08 22:06:50
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4873122126211025
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.873042287477782e+15
2022-10-08 22:07:01
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4873148103100251
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.873042287477782e+15
2022-10-08 22:07:13
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4873175635898809
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.873175635898809e+15
2022-10-08 22:07:25
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4873205014904084
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.873175635898809e+15
2022-10-08 22:07:36
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4873230381476209
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.873175635898809e+15
2022-10-08 22:07:48
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4873257555050310
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.873175635898809e+15
2022-10-08 22:07:59
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4873284533886305
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.873175635898809e+15
2022-10-08 22:08:10
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4873310485423533
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.873310485423533e+15
2022-10-08 22:08:22
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4873338488602603
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.873310485423533e+15
2022-10-08 22:08:34
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4873367364172735
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.873310485423533e+15
2022-10-08 22:08:45
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4873392294080281
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.873310485423533e+15
2022-10-08 22:08:56
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4873418518834921
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.873310485423533e+15
2022-10-08 22:09:07
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4873444370397617
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.873431788852479e+15
2022-10-08 22:09:18
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4873467921056041
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.873431788852479e+15
2022-10-08 22:09:30
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4873495841349928
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.873431788852479e+15
2022-10-08 22:09:41
iso.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4873523163115612
ifHCOutOctets{ifIndex="7",ifName="100GE1/0/3"} 4.873431788852479e+15

-------------------------------------------------------------------------------------------------------------------------------------

2022-10-08 221747.jpg

Ben Kochie

unread,
Oct 8, 2022, 10:45:46 AM10/8/22
to Betts Wang, Prometheus Users
This appears to be a device bug. For whatever reason, the device is responding to the exporter with cached results.

The snmp_exporter only presents data that is sent to it. So whatever your device is sending is incorrect.

Please contact your vendor.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Brian Candler

unread,
Oct 8, 2022, 12:22:29 PM10/8/22
to Prometheus Users
Can you try adding -Cr25 to the snmpbulkwalk command?  Or set "repetitions: 10" for snmp_exporter? The default for snmpbulkwalk is 10, but the default for snmp_exporter is 25.  It's certainly been seen for devices to show SNMP bugs when larger repetition counts are used (although in my experience, normally they just hang).

If that's not it, then try issuing snmpbulkwalk commands for the same subtrees that snmp_exporter itself does, rather than just one OID.  That is: if you are using the default if_mib in snmp.yml, then do this:

while true; do
  snmpbulkwalk -On ..(authflags).. 36.155.143.1 .1.3.6.1.2.1.2 | wc -l
  snmpbulkwalk -On ..(authflags).. 36.155.143.1 .1.3.6.1.2.1.31.1.1 | egrep '^\.1\.3\.6\.1\.2\.1\.31\.1\.1\.1\.10\.7 ='
  sleep 10
done

Make sure you do both subtrees, to make it as similar to the way snmp_exporter is querying the MIBs.

My guess is, there's something about traversing the MIB in this way which is causing the device itself to freeze its exported counters for a minute.

If you can reproduce it this way, without snmp_exporter, then you can take it up with your vendor.

And if that's not it, there's some subtle difference in the way that snmp_exporter and snmpbulkwalk are querying the device (like the -Cr option I mentioned before).  tcpdump should be able to show this, if you can use authNoPriv instead of authPriv.

Betts Wang

unread,
Oct 9, 2022, 4:47:26 AM10/9/22
to Prometheus Users
1. set "repetitions: 10" for snmp_exporter?  
Change as below, is it right? The issue is not fixed.

root@prometheus:~# tail -f /usr/local/snmp_exporter/snmp.yml
  max_repetitions: 10

  retries: 3
  timeout: 5s
  auth:
    security_level: authPriv
    username: cpucloud

2.  Do the second test as your advice.
snmpbulkwalk also don't change value sometimes. Why does this happen?

root@prometheus:~# cat test.sh
#!/bin/bash

while true
do
time=$(date "+%Y-%m-%d %H:%M:%S")
echo $time
snmpbulkwalk  -On -v3  -u cpucloud -l authpriv -a md5 -A ***** -x aes128 -X ***** 36.155.143.1 1.3.6.1.2.1.2 | wc -l
snmpbulkwalk -On -v3 -u cpucloud -l authpriv -a md5 -A ***** -x aes128 -X ***** 36.155.143.1 1.3.6.1.2.1.31.1.1 |  egrep '^\.1\.3\.6\.1\.2\.1\.31\.1\.1\.1\.10\.7 ='
sleep 10
done


root@prometheus:~# bash test.sh
2022-10-09 16:24:25
1431
.1.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4958930856500892
2022-10-09 16:24:36
1431
.1.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4959001927203249
2022-10-09 16:24:47
1431
.1.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4959001927203249
2022-10-09 16:24:58
1431
.1.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4959001927203249
2022-10-09 16:25:10
1431
.1.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4959001927203249
2022-10-09 16:25:21
1431
.1.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4959001927203249
2022-10-09 16:25:32
1431
.1.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4959078303903981
2022-10-09 16:25:43
1431
.1.3.6.1.2.1.31.1.1.1.10.7 = Counter64: 4959078303903981


3. tcpdump should be able to show this, if you can use authNoPriv instead of authPriv. 
Creat a new snmp user for troubleshooting. And tcpdump the packet within two snmpbulkwalk . See the attachment.
MobaXterm_ControllerA11_20221009_164238.txt

Betts Wang

unread,
Oct 9, 2022, 5:30:21 AM10/9/22
to Prometheus Users
By the way , what confuse me is that why a few ports of the same device  look normal.  Other device ( router or switch) also have this issue, not just a device. But all they are form the same vender.

Brian Candler

unread,
Oct 10, 2022, 3:02:29 AM10/10/22
to Prometheus Users
> snmpbulkwalk also don't change value sometimes. Why does this happen?

That's excellent.  You have now:
1. isolated the issue;
2. proved that it is nothing to do with snmp_exporter; and
3. obtained a reliable way to reproduce it using basic command line tools.

You can now take this up with your vendor.

It looks like when you walk the OID tree this way, the device is freezing all the counters for 50-60 seconds.  Why? It might be rate-limiting its OID updates as an "anti-DoS" measure; or its CPU may be so overloaded it doesn't have time to update the SNMP data; or it could just be an outright bug.  Only your vendor can explain this to you.

Until your vendor provides a resolution, I suggest you set your polling interval to 60 seconds instead of 10 seconds.

Betts Wang

unread,
Oct 11, 2022, 10:05:59 PM10/11/22
to Prometheus Users
Hi, Brian

I'm glad to tell you this issue is sloved. Thanks for your kindly  assistance.
You are right, the vender tell me that the device's if-mib has a period of sampling.

For Huawei S serial and CE serial  network device, we could set this change into.

system-view
set if-mib sample-interval 10

or

system-view
snmp-agent get-next-cache age 0

Brian Candler

unread,
Oct 12, 2022, 4:30:26 AM10/12/22
to Prometheus Users
Excellent, and thank you for coming back with the exact solution.
Reply all
Reply to author
Forward
0 new messages