Random jumps in network activity

Adam

unread,

Sep 22, 2020, 12:03:00 PM9/22/20

to Prometheus Users

I am trying to troubleshoot the cause of random jumps in the network activity. I have a custom exporter that reports the current network activity from WAN routers via SNMP. The data is scraped every two minutes and this issue only happens on specific routers while others report no issues.

Here's a picture to describe the issue I'm facing:

This metrics data is the same from Prometheus' end. I used Grafana to be able to show the issue I'm having.

As you can tell, the raw_172.19.187.146 (yellow line) is going at a smooth increase while raw_172.19.187.146 shows a massive jump for at least 2-3 minutes.

Any idea why this issue is happening? Is the query in A row acceptable to show the network activity every five minutes (since Prometheus scrapes it every 2 minutes)?

Thank you and I look forward to your response.

Ben Kochie

unread,

Sep 22, 2020, 12:30:00 PM9/22/20

to Adam, Prometheus Users

What if you graph `resets(wan_ifInOctets{site=~"$site"}[5m])`?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/e43bcf3a-2a20-4ded-b782-1c8d28028643n%40googlegroups.com.

Brian Candler

unread,

Sep 22, 2020, 1:00:04 PM9/22/20

to Prometheus Users

It's odd that the yellow graph appears to be smoothly increasing at 2 minute intervals, whilst the green one has a rate burst. A value of 200Gbps implies that the counter has gone up by 7.5TB between the start and end of the 5 minute rate window. And interestingly, the absolute value of the counter shown is 6TB (right-hand axis), which is in the same ballpark.

Is it possible that the counter value goes 6TB -> 0 -> 6TB very quickly, with the points so closely spaced that the zero value isn't picked up in the yellow graph? Or grafana is ignoring zero / null values?

It's possible to check this using the PromQL browser in the Prometheus web interface (normally port 9090), or the API, and extract the raw data from the TSDB. Do an instant query for

wan_ifInOctets{site="foo"}[10m]

and you'll get all the raw data points over that period. Adjust the time of the query so that it covers the period where things are strange, and look for values which look out of place.

My other question is, how is "wan_ifInOctets" being collected in the first place? The standard snmp_exporter if_mib would give "ifInOctets" (plus labels for the interfaces). Have you got a recording rule or something like that?

The standard ifInOctets is a 32-bit counter, with maximum value ~4GB. I was going to suggest you use ifHCInOctets, but then I see your "wan_ifInOctets" already has a value of ~6TB so it can't possibly be 32 bits. It would be good to understand where it comes from.

Brian Candler

unread,

Sep 22, 2020, 1:17:50 PM9/22/20

to Prometheus Users

Also, what's your prometheus version?

Adam

unread,

Sep 22, 2020, 1:18:44 PM9/22/20

to Prometheus Users

Here's the result of `resets(wan_ifInOctets{site=~"$site"}[5m])`:

Adam

unread,

Sep 22, 2020, 1:30:35 PM9/22/20

to Prometheus Users

`wan_ifInOctets{site="foo"}[10m]` results (timestamp is set to where the value goes off):

I changed the timestamp around and did not see anything going way off, though I may have missed it.

The wan_ifInOctets is coming from a custom exporter, wan_exporter, and it is very similar to snmp_exporter's ifInOctets. The only difference is how the exporter is called to retrieve the data. Here's an example of the output of wan_exporter if I put in a specific site:

Prometheus version: v2.20.1

Adam

unread,

Sep 22, 2020, 1:32:55 PM9/22/20

to Prometheus Users

Sorry let me clarify something. wan_ifInOctets is using ifHCInOctets's SNMP OID which is 1.3.6.1.2.1.31.1.1.1.6

On Tuesday, September 22, 2020 at 1:00:04 PM UTC-4 b.ca...@pobox.com wrote:

Ben Kochie

unread,

Sep 22, 2020, 2:24:39 PM9/22/20

to Adam, Prometheus Users

This is what I suspected. The custom exporter is producing values that drop slightly from scrape to scrape, triggering Prometheus into thinking the counter was reset.

If you did a deriv() function, you would see a negative traffic value.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/e953ad75-a60a-4fa4-aa96-6e3e39366330n%40googlegroups.com.

Adam

unread,

Sep 22, 2020, 2:59:44 PM9/22/20

to Prometheus Users

Ah interesting. I did a quick test with `deriv()` function with wan_inIfOctets and that seems to have fixed the issue. However, according to Prometheus docs, it says it should be used for gauge metrics instead of counter. Any drawbacks to using `deriv()` over `rate()`?

Ben Kochie

unread,

Sep 22, 2020, 3:36:49 PM9/22/20

to Adam, Prometheus Users

Yes deriv() doesn't handle real counter resets and produces negative values.

It's not fixed, you're just ignoring the fact that it's almost certainly producing negative rates.

Something is funky about your timestamps. It looks like two different scrape jobs are being mixed into a single series.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/64cf9d93-b5fa-47bf-97ff-cd2085153f40n%40googlegroups.com.

Brian Candler

unread,

Sep 23, 2020, 3:57:12 AM9/23/20

to Prometheus Users

As Ben says: it seems that data points are being received in the wrong order, or multiple scrapes getting intermixed, such that the counter sometimes goes down..

It would have been easier if you'd done a copy-paste rather than pasting a screen image, but I can see the timestamps going like this:

...5372

...5376

...5492

...5496

...5612

...5616

Notice the intervals of 4 seconds, 116 seconds, 4 seconds, 116 seconds etc.

It looks like you're scraping the same target twice, with both scrapes at 120 second intervals, but merging the results into the same timeseries. This is clearly broken. I suspect that either:

1. you've dropped some labels in metric relabelling that shouldn't have been dropped

2. you've changed some labels which should be distinct (e.g. you've overwritten the "job" label in relabelling, so that two different scrape jobs get the same "job" label)

3. you're federating the same timeseries from two different prometheus servers, but forgot to set external_labels to keep the two timeseries distinct

4. you've listed the same target more than once in a targets file (but without applying additional labels to distinguish the scrapes)

5. misuse of "honor_labels: true"

6. something else I haven't thought of

You'll need to check your scrape configs to work out how multiple scrapes for the same target could end up in the same timeseries (i.e. with exactly the same label set)

Adam

unread,

Sep 23, 2020, 9:07:04 AM9/23/20

to Prometheus Users

There's a very good chance it's the federate configuration that my noob self set a while back. After some research of federation, I realized I didn't even need it at all so I removed the whole federation configuration and reverted the recording rule from deriv() back to rate().

Hopefully, that should fix the issue.

Reply all

Reply to author

Forward