Weird node_exporter network metrics behaviour - NIC problem?

59 views
Skip to first unread message

Dito Windyaksa

unread,
Jan 14, 2024, 6:02:26 PM1/14/24
to Prometheus Users
Hi,

We're migrating to a new bare metal provider and noticed that the network metrics doesnt add up.

We conducted an iperf test between A and B, and noticed there are "drops" on the new machine during an ongoing iperf test.

We also did not see any bandwidth drops from both iperf server/client side.

Screenshot 2024-01-13 at 06.27.43.png

Both are running similar queries:
irate(node_network_receive_bytes_total{instance="xxx", device="eno1"}[1m])*8

One thing is certain: green line machine is running an Intel 10G NIC, while blue line machine is running an Broadcom 10G NIC.

Any ideas?
Dito

Alexander Wilke

unread,
Jan 14, 2024, 6:49:35 PM1/14/24
to Prometheus Users
Do you have the same scrape_interval for both machines?
Are you running irate on both queties or "rate" on the one and "irate" on the other?
Are the iperf intervals the same for both tests?

Dito Windyaksa

unread,
Jan 14, 2024, 8:02:59 PM1/14/24
to Prometheus Users
Yup - both are running under the same scrape interval (15s) and using the same irate query:
irate(node_network_transmit_bytes_total{instance="xxx:9100", device="eno1"}[1m])*8

It's an iperf test between each other and no interval argument is set (default zero.)

I wonder if it has something to do with how Broadcom reports network stats to /proc/net/dev?

Bryan Boreham

unread,
Jan 15, 2024, 6:24:46 AM1/15/24
to Prometheus Users
I would recommend you stop using irate().
With 4 samples per minute, irate(...[1m]) discards half your information.  This can lead to artefacts.

There is probably some instability in the underlying samples, which is worth investigating. 
An instant query like node_network_transmit_bytes_total{instance="xxx:9100", device="eno1"}[10m] will give the real, un-sampled, counts.

Dito Windyaksa

unread,
Jan 16, 2024, 8:55:06 AM1/16/24
to Prometheus Users
You're right - it's related to our irate query. We tried switching to rate() and it gives us a straight linear line during iperf tests.

We've been using irate for years across dozens of servers, but we've only noticed 'weird drops'/instability samples on this single server.

We don't see any drops during iperf tests using irate query on other servers.

Any clues why? NIC related?


Brian Candler

unread,
Jan 16, 2024, 9:20:22 AM1/16/24
to Prometheus Users
I would suspect due to how the counters are incremented and the new values published.

Suppose in the NIC's API new counter values are published at some odd interval like every 0.9 seconds. Your 15 second scrape will sometimes see the results of 16 increments from the previous counter, and sometimes 17 increments.

It's just a guess, but it's the sort of thing that can cause such artefacts.

Dito Windyaksa

unread,
Jan 16, 2024, 9:44:50 AM1/16/24
to Prometheus Users
Sounds like it. Spamming "cat /proc/net/dev" (definitely not the scientific way) showed some delay in network stats update:

Broadcom NIC
Screenshot 2024-01-16 at 22.40.32.png

Intel NIC (almost instantaneously)
Screenshot 2024-01-16 at 22.42.35.png

and Broadcom haven't updated their linux drivers since 2014.. I guess

kernel: [    9.286981] bnx2x: QLogic 5771x/578xx 10/20-Gigabit Ethernet Driver bnx2x 1.713.36-0 (2014/02/10)
Reply all
Reply to author
Forward
0 new messages