Prometheus and Bad Sample Management

Peter Zaitsev

unread,

Dec 30, 2015, 9:10:26 PM12/30/15

to Prometheus Developers

Hi,

I have been testing Prometheus with two nodes with wireless connection between them and as network went bad for some time I found bad sample came in showing disk IO bandwidth spiking to multiple GB/sec

and whole bunch of different values outside of what you would expect.

How does Prometheus deals with very long sampling ?

For example what if request to remote node_exporter took 60 seconds to complete ? Lets assume the stall was on network side which means the actual snapshot can correspond to any time within those 60 seconds.

How does Prometheus knows where ?

Note even if you put the timestamp in the metrics exporter it does not remove the problem as there are cases when MySQL can stall processing SHOW STATUS for large number of seconds.

One way we dealt with this issue in other project is to validate the samples - if something is elected to be scrapped every 5 sec the sample would be only accepted if it took 1 sec or less, as we know in this case it would not be over 20% off

Is there some similar feature in prometheus I can enable ?

Generally I think missed samples because of network issues, overload etc much better than misleading information

Brian Brazil

unread,

Dec 31, 2015, 4:59:34 AM12/31/15

to Peter Zaitsev, Prometheus Developers

On Thu, Dec 31, 2015 at 2:10 AM, Peter Zaitsev <p...@percona.com> wrote:

Hi,

I have been testing Prometheus with two nodes with wireless connection between them and as network went bad for some time I found bad sample came in showing disk IO bandwidth spiking to multiple GB/sec
and whole bunch of different values outside of what you would expect.

How does Prometheus deals with very long sampling ?

It doesn't treat it any differently. It sounds like you might have run into rate() not dealing well with appearing and disappearing timeseries, which we're in the process of improving.

We generally recommend having Prometheus on the same network as what you're monitoring, as it helps avoid this sort of problem.

For example what if request to remote node_exporter took 60 seconds to complete ?

The default timeout is 10s, so by default it'd be a failed scrape.

Lets assume the stall was on network side which means the actual snapshot can correspond to any time within those 60 seconds.
How does Prometheus knows where ?

It doesn't know, but it'll always choose the same time (start of the scrape iirc). This is on the presumption that the delay is constant, which it usually is going to be.

Note even if you put the timestamp in the metrics exporter it does not remove the problem as there are cases when MySQL can stall processing SHOW STATUS for large number of seconds.

One way we dealt with this issue in other project is to validate the samples - if something is elected to be scrapped every 5 sec the sample would be only accepted if it took 1 sec or less, as we know in this case it would not be over 20% off

Is there some similar feature in prometheus I can enable ?

Generally I think missed samples because of network issues, overload etc much better than misleading information

It sounds like you're looking for timeouts.

Brian

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

Björn Rabenstein

unread,

Jan 1, 2016, 5:17:53 PM1/1/16

to Brian Brazil, Peter Zaitsev, Prometheus Developers

Independent from what's discussed here, I think we should clearly
document which timestamp Prometheus will attach for a scrape
(beginning or end). The case where a monitored target knows exactly
the timestamp of the value (and cares about it) might be a good
use-case for an explicitly provided timestamp. We wouldn't run into
staleness problems as long as the timestamp is close to the time of
the scrape.

--
Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B

Peter Zaitsev

unread,

Jan 1, 2016, 5:31:27 PM1/1/16

to Brian Brazil, Prometheus Developers

Brian,

Network is just one example. Looking specifically at MySQL at times of high load it can take a long time to respond providing bizarre spikes in the data. Really pretty much for everything we're not dealing with hard realtime systems so it can respond with significant delay with some probability. Not a big deal if you're just interested in average but if you want to "zoom in" for 1 second resolution it becomes hard to see what is noise and what is data.

I do not really mind whenever it is implemented as timeout or additional feature to throw away too long samples.

You mentioned timeout is available - can I set it to 100ms for 1ms capture so invalid captures can be ignored ?

--

Peter Zaitsev, CEO, Percona
Tel: +1 888 401 3401 ext 7360 Skype: peter_zaitsev

Peter Zaitsev

unread,

Jan 1, 2016, 5:35:36 PM1/1/16

to Brian Brazil, Prometheus Developers

Hm,

It looks you're speaking about scrape_timeout

scrape_timeout: 100ms

Looks like neither trying to specify it as 100ms or 0.1s does not work

Is there good reason to have 1sec min resolution for timeout ?

Brian Brazil

unread,

Jan 1, 2016, 5:47:25 PM1/1/16

to Peter Zaitsev, Prometheus Developers

On Fri, Jan 1, 2016 at 10:31 PM, Peter Zaitsev <p...@percona.com> wrote:

Brian,

Network is just one example. Looking specifically at MySQL at times of high load it can take a long time to respond providing bizarre spikes in the data. Really pretty much for everything we're not dealing with hard realtime systems so it can respond with significant delay with some probability. Not a big deal if you're just interested in average but if you want to "zoom in" for 1 second resolution it becomes hard to see what is noise and what is data.

There is no general solution to this problem, if the system you're trying to monitor is overloaded then all bets are off. I'd recommend for high-granularity scrapes (anything under 10s) to keep them to the bare minimum metrics and ensure the code paths involved are very fast.

Brian

--

Brian Brazil

www.robustperception.io

Peter Zaitsev

unread,

Jan 1, 2016, 6:03:50 PM1/1/16

to Brian Brazil, Prometheus Developers

I understand that,

I think we're speaking about different things here. If some information _Usually_ takes 10 seconds to sample of course there is no point to sample it at 1sec rate

However if you look at MySQL and simple SHOW STATUS sample it is very feasible at 1sec resolution and is often very helpful to problem diagnostics

This is however exactly when it is important to know what is the real data and what is noise of invalid capture. For example I might well correlate Application timeouts

to the query spike where 50.000 QPS is reported instead of 5.000 QPS which is normal. This well may be reality due to some stalls on application level for example,

or it may be some other problem caused SHOW STATUS to be slow to return giving us wrong data and setting it on wild goose chase.

If there is no solution as of right now. I just wanted to make sure it is understood what is the problem and why it is important for me.

Brian Brazil

unread,

Jan 1, 2016, 6:31:15 PM1/1/16

to Peter Zaitsev, Prometheus Developers

On Fri, Jan 1, 2016 at 11:03 PM, Peter Zaitsev <p...@percona.com> wrote:

I understand that,

I think we're speaking about different things here. If some information _Usually_ takes 10 seconds to sample of course there is no point to sample it at 1sec rate

However if you look at MySQL and simple SHOW STATUS sample it is very feasible at 1sec resolution and is often very helpful to problem diagnostics

This is however exactly when it is important to know what is the real data and what is noise of invalid capture. For example I might well correlate Application timeouts
to the query spike where 50.000 QPS is reported instead of 5.000 QPS which is normal. This well may be reality due to some stalls on application level for example,
or it may be some other problem caused SHOW STATUS to be slow to return giving us wrong data and setting it on wild goose chase.

If there is no solution as of right now. I just wanted to make sure it is understood what is the problem and why it is important for me.

I understand where you're coming from. I don't think there's any good solution at the Prometheus or exporter level. The true solution is to make the application not overloaded, which obviously has it's own challenges. This is the reason we have scrape_duration_seconds, to help detect overloaded systems when the system itself can't be trusted.

Throwing away data will cause artifacts with irate. Messing around with timestamps only moves the problem around, exposing you to slightly different races. Both of these are likely to cause more problems than they solve.

At a 1s scrape there should be a 1s timeout. At worst you should see a spike 2x of the true value. Using rate() rather than irate() will smooth out such things out a bit too.

Brian

--

Brian Brazil

www.robustperception.io

Peter Zaitsev

unread,

Jan 1, 2016, 7:17:33 PM1/1/16

to Brian Brazil, Prometheus Developers

I understand where you're coming from. I don't think there's any good solution at the Prometheus or exporter level. The true solution is to make the application not overloaded, which obviously has it's own challenges. This is the reason we have scrape_duration_seconds, to help detect overloaded systems when the system itself can't be trusted.

Throwing away data will cause artifacts with irate. Messing around with timestamps only moves the problem around, exposing you to slightly different races. Both of these are likely to cause more problems than they solve.

At a 1s scrape there should be a 1s timeout. At worst you should see a spike 2x of the true value. Using rate() rather than irate() will smooth out such things out a bit too.

You're right 1s scrape with 1s timeout will not cause things to be off more than double. I would like to have data more trusted having it not to be more than 20% off from reality.

Too bad it currently can't be configured

I do not think messing with timestamp is just moving problem around. One is stating I have 50000 queries in the given time which is obviously misleading any kind of further processing of the data which can be going.

If however you simply do not have any data for the given period of time it is fine as such systems typically not design expecting absolutely 100% of samples being present

If I understand how irate() works correctly it suppose to simply skip such sample. Ie If I have counter of queries of 100K at 0 and when few samples had to be thrown out and 10 seconds later I get counter of 150K the irate will see the change of 50K over 10 seconds period and report rate of 50000 - exactly what I'm looking for. Of course I will not have any details of at which of these 10 seconds those queries came but it is OK

Brian Brazil

unread,

Jan 1, 2016, 7:51:41 PM1/1/16

to Peter Zaitsev, Prometheus Developers

On Sat, Jan 2, 2016 at 12:17 AM, Peter Zaitsev <p...@percona.com> wrote:

I understand where you're coming from. I don't think there's any good solution at the Prometheus or exporter level. The true solution is to make the application not overloaded, which obviously has it's own challenges. This is the reason we have scrape_duration_seconds, to help detect overloaded systems when the system itself can't be trusted.

Throwing away data will cause artifacts with irate. Messing around with timestamps only moves the problem around, exposing you to slightly different races. Both of these are likely to cause more problems than they solve.

At a 1s scrape there should be a 1s timeout. At worst you should see a spike 2x of the true value. Using rate() rather than irate() will smooth out such things out a bit too.

You're right 1s scrape with 1s timeout will not cause things to be off more than double. I would like to have data more trusted having it not to be more than 20% off from reality.
Too bad it currently can't be configured

I do not think messing with timestamp is just moving problem around.

Consider what happens if a long pause happens in the middle of the SHOW STATUS. No matter what you do, at least half the values will have the wrong timestamp. This is an unavoidable race condition. There's also the problem of distributed time synchronisation.

One is stating I have 50000 queries in the given time which is obviously misleading any kind of further processing of the data which can be going.

If however you simply do not have any data for the given period of time it is fine as such systems typically not design expecting absolutely 100% of samples being present

If I understand how irate() works correctly it suppose to simply skip such sample. Ie If I have counter of queries of 100K at 0 and when few samples had to be thrown out and 10 seconds later I get counter of 150K the irate will see the change of 50K over 10 seconds period and report rate of 50000 - exactly what I'm looking for. Of course I will not have any details of at which of these 10 seconds those queries came but it is OK

It'll continue to report 50k until it gets another value, and in the gap will report the old value. It reports the rate looking back in time, not the rate at a given time.

Brian

--
Peter Zaitsev, CEO, Percona
Tel: +1 888 401 3401 ext 7360 Skype: peter_zaitsev

--

Brian Brazil

www.robustperception.io

Peter Zaitsev

unread,

Jan 1, 2016, 8:11:59 PM1/1/16

to Brian Brazil, Prometheus Developers

I do not think messing with timestamp is just moving problem around.

Consider what happens if a long pause happens in the middle of the SHOW STATUS. No matter what you do, at least half the values will have the wrong timestamp. This is an unavoidable race condition. There's also the problem of distributed time synchronisation.

I'm confused where do you see the time synchronization problem here ? As I understand exporters do not even report the time. They report certain variable value which prometheus assigns timestamp based on when it got the data. What do you see as being synchronized here ?

What should happen if SHOW STATUS stalls in the middle ? In this case in my opinion we do not get complete sample and it should be discarded

One is stating I have 50000 queries in the given time which is obviously misleading any kind of further processing of the data which can be going.
If however you simply do not have any data for the given period of time it is fine as such systems typically not design expecting absolutely 100% of samples being present

If I understand how irate() works correctly it suppose to simply skip such sample. Ie If I have counter of queries of 100K at 0 and when few samples had to be thrown out and 10 seconds later I get counter of 150K the irate will see the change of 50K over 10 seconds period and report rate of 50000 - exactly what I'm looking for. Of course I will not have any details of at which of these 10 seconds those queries came but it is OK

It'll continue to report 50k until it gets another value, and in the gap will report the old value. It reports the rate looking back in time, not the rate at a given time.

Are you sure about that ? In my experiments if I make some samples unavailable (ie pausing exporter for a while) irate() seems to deal quite OK with

Here is example of the graph from grafana - I stopped the "node exporter" on this node so some samples were not available, yet it looks like average bandwidth for this time was computed properly:

This is the query which was used:

irate(node_disk_sectors_read{alias="$host",device!~"dm-*"}[5m])*512

Brian Brazil

unread,

Jan 2, 2016, 12:10:31 PM1/2/16

to Peter Zaitsev, Prometheus Developers

On Sat, Jan 2, 2016 at 1:11 AM, Peter Zaitsev <p...@percona.com> wrote:

I do not think messing with timestamp is just moving problem around.

Consider what happens if a long pause happens in the middle of the SHOW STATUS. No matter what you do, at least half the values will have the wrong timestamp. This is an unavoidable race condition. There's also the problem of distributed time synchronisation.

I'm confused where do you see the time synchronization problem here ? As I understand exporters do not even report the time. They report certain variable value which prometheus assigns timestamp based on when it got the data. What do you see as being synchronized here ?

That's correct. You had mentioned having exporters sending the time, so I was pointing out the challenges with that.

What should happen if SHOW STATUS stalls in the middle ? In this case in my opinion we do not get complete sample and it should be discarded

One is stating I have 50000 queries in the given time which is obviously misleading any kind of further processing of the data which can be going.
If however you simply do not have any data for the given period of time it is fine as such systems typically not design expecting absolutely 100% of samples being present

If I understand how irate() works correctly it suppose to simply skip such sample. Ie If I have counter of queries of 100K at 0 and when few samples had to be thrown out and 10 seconds later I get counter of 150K the irate will see the change of 50K over 10 seconds period and report rate of 50000 - exactly what I'm looking for. Of course I will not have any details of at which of these 10 seconds those queries came but it is OK

It'll continue to report 50k until it gets another value, and in the gap will report the old value. It reports the rate looking back in time, not the rate at a given time.

Are you sure about that ? In my experiments if I make some samples unavailable (ie pausing exporter for a while) irate() seems to deal quite OK with

Here is example of the graph from grafana - I stopped the "node exporter" on this node so some samples were not available, yet it looks like average bandwidth for this time was computed properly:

This is the query which was used:

irate(node_disk_sectors_read{alias="$host",device!~"dm-*"}[5m])*512

It'll look okay, but it doesn't mean what you think it means. That long flat period's true value is actually a bit lower, as the next point after it is lower.

--

Brian Brazil

www.robustperception.io

Peter Zaitsev

unread,

Jan 2, 2016, 12:28:01 PM1/2/16

to Brian Brazil, Prometheus Developers

Brian,

Can you please be more specific in terms of math/or code when you say It does not mean what I think it means.

I have worked with enough monitoring tools in the industry to have understanding how the rate is going to be computed for sparse time series.

This is done the following way

If we have 2 samples with values N1 and N2 corresponding to times N1 and T2

T1 N1

T2 N2

Instant Rate = (N2-N1)/(T2-T1)

Does Prometheus does it some other rate, how ?

If this is how it does it when loosing any samples, ie insted of of T2 which was T1+1sec we have data available only on T3 which is T2+5min we would still get correct average rate of increase in the N value, just it will be average over 5 min not 1 sec.

The lower data point in this graph do not have to do to do with period when exporter was down - it is imply volatile time series which changes a lot as you can see it had dips before and after.

Brian Brazil

unread,

Jan 2, 2016, 12:35:25 PM1/2/16

to Peter Zaitsev, Prometheus Developers

On Sat, Jan 2, 2016 at 5:28 PM, Peter Zaitsev <p...@percona.com> wrote:

Brian,

Can you please be more specific in terms of math/or code when you say It does not mean what I think it means.

I have worked with enough monitoring tools in the industry to have understanding how the rate is going to be computed for sparse time series.
This is done the following way

If we have 2 samples with values N1 and N2 corresponding to times N1 and T2

T1 N1
T2 N2

Instant Rate = (N2-N1)/(T2-T1)

Does Prometheus does it some other rate, how ?

That is how it's calculated, but that's not the full computational model. Between T1 and T2, the T0->T1 rate is returned. Between T2 and T3, the T1->T2 rate is returned. Computations in Prometheus are always done looking backwards, so the values will be correct but not necessarily with the timestamp you expect.

There's no workaround for irate, but for rate what you can do is request a timestamp half the period of the rate forwards in time to correct for this. For example for a 5m rate you'd ask for a time 2.5m ahead.

Brian

--

Brian Brazil

www.robustperception.io

Peter Zaitsev

unread,

Jan 2, 2016, 12:37:47 PM1/2/16

to Brian Brazil, Prometheus Developers

Brian,

Thinking a little bit more about that....

What you seems to be saying is what Prometheus does not map data to the interval in the way I would think, not how irate() computes it ?

Lets look at the following times and value

T V

1 - 100

2 - 200

10 - 1800

As available samples with 8 seconds skipped

If we would look at mathematically correct answer on the question - what rate was at the at the T=7 we can see it belongs to interval 2..10 where average rate was 200/sec, however is I understand what you're saying correctly Prometheus will report last computed value (based on 1..2 interval rate) until T=10, so reported rate for T=7 will be 100 not 200 as would be correct in this case and as such graph plotted with second resolution would have straight line of "100" going until value 200 is reported at T10 ?

Is this correct ?

Brian Brazil

unread,

Jan 2, 2016, 12:43:30 PM1/2/16

to Peter Zaitsev, Prometheus Developers

This is correct. At T=7 we don't have the T=10 value yet, so the best we can do is return 100/s. We also don't want the values at a given point to change based on future points, as that'd make computations non-repeatable and hard to debug.

--

Brian Brazil

www.robustperception.io

Peter Zaitsev

unread,

Jan 2, 2016, 12:48:00 PM1/2/16

to Brian Brazil, Prometheus Developers

Brian,

Thanks. Now it makes sense. This is a bummer... which makes it pretty useless for sparse series.

This also makes dashboards to be very hard to be done when sample rate is unknown.

One semi solution here is to pass very short interval to irate so it can at least report no data available instead of showing the last known increase when

I wish we would have some function as firate() or something like that which reports the rate of the interval to which this data point belongs not the previous interval as it is now

Peter Zaitsev

unread,

Jan 2, 2016, 12:48:04 PM1/2/16

to Brian Brazil, Prometheus Developers

Brian,

For irate() there is actually very simple solution for this problem as it needs to look only one data point upfront it is not a big deal to say the value for T is not available until there is data point with T1>T exists. It delays what you can plot on graphs generally by one sample intervals but make them much better for my taste not reporting any stale data.

If one would like to implement such function in prometheus in addition to irate what one would do ?

Brian Brazil

unread,

Jan 2, 2016, 12:59:39 PM1/2/16

to Peter Zaitsev, Prometheus Developers

On Sat, Jan 2, 2016 at 5:48 PM, Peter Zaitsev <p...@percona.com> wrote:

Brian,

For irate() there is actually very simple solution for this problem as it needs to look only one data point upfront it is not a big deal to say the value for T is not available until there is data point with T1>T exists. It delays what you can plot on graphs generally by one sample intervals but make them much better for my taste not reporting any stale data.

If one would like to implement such function in prometheus in addition to irate what one would do ?

Such a function doesn't really fit the computation model, as it goes against how we think about range vectors and breaks the invariant that values don't change as newer values come in.

There's nothing stopping you from doing this at the graphing layer though with rate(), by increasing the timestamps you're requesting by half the period.

Brian

On Sat, Jan 2, 2016 at 12:43 PM, Brian Brazil <brian....@robustperception.io> wrote:
On Sat, Jan 2, 2016 at 5:37 PM, Peter Zaitsev <p...@percona.com> wrote:
Brian,

Thinking a little bit more about that....

What you seems to be saying is what Prometheus does not map data to the interval in the way I would think, not how irate() computes it ?

Lets look at the following times and value

T V

1 - 100
2 - 200
10 - 1800

As available samples with 8 seconds skipped

If we would look at mathematically correct answer on the question - what rate was at the at the T=7 we can see it belongs to interval 2..10 where average rate was 200/sec, however is I understand what you're saying correctly Prometheus will report last computed value (based on 1..2 interval rate) until T=10, so reported rate for T=7 will be 100 not 200 as would be correct in this case and as such graph plotted with second resolution would have straight line of "100" going until value 200 is reported at T10 ?

Is this correct ?

This is correct. At T=7 we don't have the T=10 value yet, so the best we can do is return 100/s. We also don't want the values at a given point to change based on future points, as that'd make computations non-repeatable and hard to debug.

--
Brian Brazil
www.robustperception.io

--
Peter Zaitsev, CEO, Percona
Tel: +1 888 401 3401 ext 7360 Skype: peter_zaitsev

--

Brian Brazil

www.robustperception.io

Peter Zaitsev

unread,

Jan 2, 2016, 1:55:44 PM1/2/16

to Brian Brazil, Prometheus Developers

Brian,

Understood. The problem with rate is unlike irate it is impossible to make it to provide the information at the maximum resolution available.

I'm looking to build dashboards which are independent on capture rate and completely zoomable.

Reply all

Reply to author

Forward