Prometheus and Bad Sample Management

375 views
Skip to first unread message

Peter Zaitsev

unread,
Dec 30, 2015, 9:10:26 PM12/30/15
to Prometheus Developers
Hi,

I have been testing Prometheus with two nodes with wireless connection between them and as network went bad for some time I found bad sample came in showing disk IO bandwidth spiking to multiple GB/sec 
and whole bunch of different values outside of what you would expect.

How does Prometheus deals with very long sampling ?

For example what if request to remote node_exporter  took  60 seconds to complete ?     Lets assume the stall was on network side which means the actual snapshot can correspond to any time within those 60 seconds.  
How does Prometheus knows where ?

Note even if you put the timestamp in the metrics exporter it does not remove the problem as there are cases when MySQL can stall  processing SHOW STATUS for large number of seconds.

One way we dealt with this issue in other project is to validate the samples - if something is elected to be scrapped every 5 sec the sample would be only accepted if it took 1 sec or less,  as we know in this case it would not be over 20% off

Is there some similar feature in prometheus I can enable ?

Generally I think missed samples because of network issues, overload etc much better than misleading information


Brian Brazil

unread,
Dec 31, 2015, 4:59:34 AM12/31/15
to Peter Zaitsev, Prometheus Developers
On Thu, Dec 31, 2015 at 2:10 AM, Peter Zaitsev <p...@percona.com> wrote:
Hi,

I have been testing Prometheus with two nodes with wireless connection between them and as network went bad for some time I found bad sample came in showing disk IO bandwidth spiking to multiple GB/sec 
and whole bunch of different values outside of what you would expect.

How does Prometheus deals with very long sampling ?

It doesn't treat it any differently. It sounds like you might have run into rate() not dealing well with appearing and disappearing timeseries, which we're in the process of improving.

We generally recommend having Prometheus on the same network as what you're monitoring, as it helps avoid this sort of problem.
 

For example what if request to remote node_exporter  took  60 seconds to complete ?

The default timeout is 10s, so by default it'd be a failed scrape.
 
    Lets assume the stall was on network side which means the actual snapshot can correspond to any time within those 60 seconds.  
How does Prometheus knows where ?

It doesn't know, but it'll always choose the same time (start of the scrape iirc). This is on the presumption that the delay is constant, which it usually is going to be. 

Note even if you put the timestamp in the metrics exporter it does not remove the problem as there are cases when MySQL can stall  processing SHOW STATUS for large number of seconds.

One way we dealt with this issue in other project is to validate the samples - if something is elected to be scrapped every 5 sec the sample would be only accepted if it took 1 sec or less,  as we know in this case it would not be over 20% off

Is there some similar feature in prometheus I can enable ?

Generally I think missed samples because of network issues, overload etc much better than misleading information

It sounds like you're looking for timeouts.

Brian 


--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Björn Rabenstein

unread,
Jan 1, 2016, 5:17:53 PM1/1/16
to Brian Brazil, Peter Zaitsev, Prometheus Developers
Independent from what's discussed here, I think we should clearly
document which timestamp Prometheus will attach for a scrape
(beginning or end). The case where a monitored target knows exactly
the timestamp of the value (and cares about it) might be a good
use-case for an explicitly provided timestamp. We wouldn't run into
staleness problems as long as the timestamp is close to the time of
the scrape.

--
Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B

Peter Zaitsev

unread,
Jan 1, 2016, 5:31:27 PM1/1/16
to Brian Brazil, Prometheus Developers
Brian,

Network  is just one example.   Looking specifically at MySQL at times of high load it can take a long time to respond providing bizarre spikes in the data.  Really pretty much for everything we're not dealing with hard realtime systems so it can respond with significant delay with some probability. Not a big deal if you're just interested in average but if you want to "zoom in" for 1 second resolution   it becomes hard to see what is noise and what is data.

I do not really mind whenever it is implemented as timeout or additional  feature to throw away too long samples.

You mentioned timeout is available - can I set it to 100ms  for 1ms capture so  invalid captures can be ignored ? 

--
Peter Zaitsev, CEO, Percona
Tel: +1 888 401 3401 ext 7360   Skype:  peter_zaitsev



Peter Zaitsev

unread,
Jan 1, 2016, 5:35:36 PM1/1/16
to Brian Brazil, Prometheus Developers
Hm,

It looks you're speaking about scrape_timeout

 scrape_timeout:    100ms


Looks like neither trying to specify it as 100ms or  0.1s does not work

Is there good reason to have 1sec min resolution for timeout ? 


Brian Brazil

unread,
Jan 1, 2016, 5:47:25 PM1/1/16
to Peter Zaitsev, Prometheus Developers
On Fri, Jan 1, 2016 at 10:31 PM, Peter Zaitsev <p...@percona.com> wrote:
Brian,

Network  is just one example.   Looking specifically at MySQL at times of high load it can take a long time to respond providing bizarre spikes in the data.  Really pretty much for everything we're not dealing with hard realtime systems so it can respond with significant delay with some probability. Not a big deal if you're just interested in average but if you want to "zoom in" for 1 second resolution   it becomes hard to see what is noise and what is data.

There is no general solution to this problem, if the system you're trying to monitor is overloaded then all bets are off. I'd recommend for high-granularity scrapes (anything under 10s) to keep them to the bare minimum metrics and ensure the code paths involved are very fast.

Brian



--

Peter Zaitsev

unread,
Jan 1, 2016, 6:03:50 PM1/1/16
to Brian Brazil, Prometheus Developers
I understand that,

I think we're speaking about different things here.  If some information _Usually_ takes 10 seconds to sample of course there is no point to sample it at 1sec rate

However if you look at MySQL and  simple SHOW STATUS sample it is very feasible at 1sec resolution and is often very helpful to problem diagnostics

This is however exactly when it is important to know what is the real data and what is noise of invalid capture.  For example I might well correlate Application timeouts
to the query spike where 50.000  QPS  is reported instead of 5.000 QPS which is normal.    This well may be reality due to some stalls on application level for example,
or it may be some other problem caused SHOW STATUS to be slow to return giving us wrong data and setting it on wild goose chase. 

If there is no solution as of right now. I just wanted to make sure it is understood what is the problem and why it is important for me.



Brian Brazil

unread,
Jan 1, 2016, 6:31:15 PM1/1/16
to Peter Zaitsev, Prometheus Developers
On Fri, Jan 1, 2016 at 11:03 PM, Peter Zaitsev <p...@percona.com> wrote:
I understand that,

I think we're speaking about different things here.  If some information _Usually_ takes 10 seconds to sample of course there is no point to sample it at 1sec rate

However if you look at MySQL and  simple SHOW STATUS sample it is very feasible at 1sec resolution and is often very helpful to problem diagnostics

This is however exactly when it is important to know what is the real data and what is noise of invalid capture.  For example I might well correlate Application timeouts
to the query spike where 50.000  QPS  is reported instead of 5.000 QPS which is normal.    This well may be reality due to some stalls on application level for example,
or it may be some other problem caused SHOW STATUS to be slow to return giving us wrong data and setting it on wild goose chase. 

If there is no solution as of right now. I just wanted to make sure it is understood what is the problem and why it is important for me.

I understand where you're coming from. I don't think there's any good solution at the Prometheus or exporter level. The true solution is to make the application not overloaded, which obviously has it's own challenges. This is the reason we have scrape_duration_seconds, to help detect overloaded systems when the system itself can't be trusted.

Throwing away data will cause artifacts with irate. Messing around with timestamps only moves the problem around, exposing you to slightly different races. Both of these are likely to cause more problems than they solve.

At a 1s scrape there should be a 1s timeout. At worst you should see a spike 2x of the true value. Using rate() rather than irate() will smooth out such things out a bit too.

Brian 



--

Peter Zaitsev

unread,
Jan 1, 2016, 7:17:33 PM1/1/16
to Brian Brazil, Prometheus Developers

I understand where you're coming from. I don't think there's any good solution at the Prometheus or exporter level. The true solution is to make the application not overloaded, which obviously has it's own challenges. This is the reason we have scrape_duration_seconds, to help detect overloaded systems when the system itself can't be trusted.

Throwing away data will cause artifacts with irate. Messing around with timestamps only moves the problem around, exposing you to slightly different races. Both of these are likely to cause more problems than they solve.

At a 1s scrape there should be a 1s timeout. At worst you should see a spike 2x of the true value. Using rate() rather than irate() will smooth out such things out a bit too.


You're right 1s scrape with 1s timeout will not cause things to be off more than double.  I would like to have data more trusted  having it not to be more than 20% off from reality. 
Too bad it currently can't be configured

I do not think messing with timestamp is just moving problem around.    One is stating I have  50000 queries in the given time which is obviously misleading any kind of further processing of the data which can be going.
If however you simply do not have any data for the given period of time it is fine as such systems typically not design expecting absolutely 100% of samples being present 

If I understand how irate() works correctly  it suppose to simply skip such sample.  Ie If I have  counter of queries of 100K at  0  and when  few samples had to be thrown out and 10 seconds later I get counter of 150K   the irate will see the change of 50K over 10 seconds period and report  rate of 50000   - exactly what I'm looking for.     Of course I will not have any details of at which of these 10 seconds those queries came but it is OK 
 

Brian Brazil

unread,
Jan 1, 2016, 7:51:41 PM1/1/16
to Peter Zaitsev, Prometheus Developers
On Sat, Jan 2, 2016 at 12:17 AM, Peter Zaitsev <p...@percona.com> wrote:

I understand where you're coming from. I don't think there's any good solution at the Prometheus or exporter level. The true solution is to make the application not overloaded, which obviously has it's own challenges. This is the reason we have scrape_duration_seconds, to help detect overloaded systems when the system itself can't be trusted.

Throwing away data will cause artifacts with irate. Messing around with timestamps only moves the problem around, exposing you to slightly different races. Both of these are likely to cause more problems than they solve.

At a 1s scrape there should be a 1s timeout. At worst you should see a spike 2x of the true value. Using rate() rather than irate() will smooth out such things out a bit too.


You're right 1s scrape with 1s timeout will not cause things to be off more than double.  I would like to have data more trusted  having it not to be more than 20% off from reality. 
Too bad it currently can't be configured

I do not think messing with timestamp is just moving problem around.  
 
Consider what happens if a long pause happens in the middle of the SHOW STATUS. No matter what you do, at least half the values will have the wrong timestamp. This is an unavoidable race condition. There's also the problem of distributed time synchronisation.
 
 One is stating I have  50000 queries in the given time which is obviously misleading any kind of further processing of the data which can be going.
If however you simply do not have any data for the given period of time it is fine as such systems typically not design expecting absolutely 100% of samples being present 

If I understand how irate() works correctly  it suppose to simply skip such sample.  Ie If I have  counter of queries of 100K at  0  and when  few samples had to be thrown out and 10 seconds later I get counter of 150K   the irate will see the change of 50K over 10 seconds period and report  rate of 50000   - exactly what I'm looking for.     Of course I will not have any details of at which of these 10 seconds those queries came but it is OK 

It'll continue to report 50k until it gets another value, and in the gap will report the old value. It reports the rate looking back in time, not the rate at a given time.

Brian
 
 


--
Peter Zaitsev, CEO, Percona
Tel: +1 888 401 3401 ext 7360   Skype:  peter_zaitsev






--

Peter Zaitsev

unread,
Jan 1, 2016, 8:11:59 PM1/1/16
to Brian Brazil, Prometheus Developers
I do not think messing with timestamp is just moving problem around.  
 
Consider what happens if a long pause happens in the middle of the SHOW STATUS. No matter what you do, at least half the values will have the wrong timestamp. This is an unavoidable race condition. There's also the problem of distributed time synchronisation.

I'm confused where do you see the time synchronization problem here ? As I understand exporters do not even report the time. They report certain variable value which prometheus assigns timestamp based on when it got the data.  What do you see as being synchronized here ? 

What should happen if SHOW STATUS stalls in the middle ?  In this case in my opinion we do not get complete sample and it should be discarded  



 
 
 One is stating I have  50000 queries in the given time which is obviously misleading any kind of further processing of the data which can be going.
If however you simply do not have any data for the given period of time it is fine as such systems typically not design expecting absolutely 100% of samples being present 

If I understand how irate() works correctly  it suppose to simply skip such sample.  Ie If I have  counter of queries of 100K at  0  and when  few samples had to be thrown out and 10 seconds later I get counter of 150K   the irate will see the change of 50K over 10 seconds period and report  rate of 50000   - exactly what I'm looking for.     Of course I will not have any details of at which of these 10 seconds those queries came but it is OK 

It'll continue to report 50k until it gets another value, and in the gap will report the old value. It reports the rate looking back in time, not the rate at a given time.


Are you sure  about that ?  In my experiments  if I make some samples unavailable (ie  pausing exporter for a while)  irate() seems to deal quite OK with  

 Here is example of the graph from grafana - I stopped the "node exporter" on this node so some samples were not available, yet it looks like average bandwidth for this time was computed properly:

Inline image 1

This is the query which was used:

irate(node_disk_sectors_read{alias="$host",device!~"dm-*"}[5m])*512

Brian Brazil

unread,
Jan 2, 2016, 12:10:31 PM1/2/16
to Peter Zaitsev, Prometheus Developers
On Sat, Jan 2, 2016 at 1:11 AM, Peter Zaitsev <p...@percona.com> wrote:
I do not think messing with timestamp is just moving problem around.  
 
Consider what happens if a long pause happens in the middle of the SHOW STATUS. No matter what you do, at least half the values will have the wrong timestamp. This is an unavoidable race condition. There's also the problem of distributed time synchronisation.

I'm confused where do you see the time synchronization problem here ? As I understand exporters do not even report the time. They report certain variable value which prometheus assigns timestamp based on when it got the data.  What do you see as being synchronized here ?

That's correct. You had mentioned having exporters sending the time, so I was pointing out the challenges with that.

What should happen if SHOW STATUS stalls in the middle ?  In this case in my opinion we do not get complete sample and it should be discarded  



 
 
 One is stating I have  50000 queries in the given time which is obviously misleading any kind of further processing of the data which can be going.
If however you simply do not have any data for the given period of time it is fine as such systems typically not design expecting absolutely 100% of samples being present 

If I understand how irate() works correctly  it suppose to simply skip such sample.  Ie If I have  counter of queries of 100K at  0  and when  few samples had to be thrown out and 10 seconds later I get counter of 150K   the irate will see the change of 50K over 10 seconds period and report  rate of 50000   - exactly what I'm looking for.     Of course I will not have any details of at which of these 10 seconds those queries came but it is OK 

It'll continue to report 50k until it gets another value, and in the gap will report the old value. It reports the rate looking back in time, not the rate at a given time.


Are you sure  about that ?  In my experiments  if I make some samples unavailable (ie  pausing exporter for a while)  irate() seems to deal quite OK with  

 Here is example of the graph from grafana - I stopped the "node exporter" on this node so some samples were not available, yet it looks like average bandwidth for this time was computed properly:

Inline image 1

This is the query which was used:

irate(node_disk_sectors_read{alias="$host",device!~"dm-*"}[5m])*512

It'll look okay, but it doesn't mean what you think it means. That long flat period's true value is actually a bit lower, as the next point after it is lower.

--

Peter Zaitsev

unread,
Jan 2, 2016, 12:28:01 PM1/2/16
to Brian Brazil, Prometheus Developers
Brian,

Can you please be more specific in terms of math/or code when you say It does not mean what I think it means. 

I have worked with enough monitoring tools in the industry to have understanding how  the rate is going to be computed for sparse time series. 
This is done the following way 

If we have 2  samples  with values N1 and N2 corresponding to times N1 and T2 

T1   N1
T2   N2 


Instant Rate  =   (N2-N1)/(T2-T1)  

Does Prometheus does it some other rate, how ? 

If this is how it does it when  loosing any  samples,   ie insted of of T2 which was T1+1sec we have  data available only on T3  which is T2+5min    we would still get correct average rate of increase in the N value,  just it will be average over 5 min not 1 sec.



The lower data point in this graph do not have to do to do with period when exporter was down - it is imply volatile time series which changes a lot as you can see it had dips before and after. 

Brian Brazil

unread,
Jan 2, 2016, 12:35:25 PM1/2/16
to Peter Zaitsev, Prometheus Developers
On Sat, Jan 2, 2016 at 5:28 PM, Peter Zaitsev <p...@percona.com> wrote:
Brian,

Can you please be more specific in terms of math/or code when you say It does not mean what I think it means. 

I have worked with enough monitoring tools in the industry to have understanding how  the rate is going to be computed for sparse time series. 
This is done the following way 

If we have 2  samples  with values N1 and N2 corresponding to times N1 and T2 

T1   N1
T2   N2 


Instant Rate  =   (N2-N1)/(T2-T1)  

Does Prometheus does it some other rate, how ? 

That is how it's calculated, but that's not the full computational model. Between T1 and T2, the T0->T1 rate is returned. Between T2 and T3, the T1->T2 rate is returned. Computations in Prometheus are always done looking backwards, so the values will be correct but not necessarily with the timestamp you expect.

There's no workaround for irate, but for rate what you can do is request a timestamp half the period of the rate forwards in time to correct for this. For example for a 5m rate you'd ask for a time 2.5m ahead.

Brian



--

Peter Zaitsev

unread,
Jan 2, 2016, 12:37:47 PM1/2/16
to Brian Brazil, Prometheus Developers
Brian,

Thinking a little bit more about that....

What you seems to be saying is what Prometheus does not map data to the interval in the way I would think, not how irate() computes it ? 

Lets look at the following  times and value

T     V 

1    - 100
2    - 200
10   - 1800

As available samples with 8 seconds skipped

If we  would look at mathematically correct answer on the question - what rate was at the at the T=7  we can see it belongs to   interval 2..10   where average rate was 200/sec,  however is I understand what you're saying correctly Prometheus will report last computed value (based on 1..2 interval rate)  until  T=10,  so reported rate for T=7 will be   100 not 200 as would be correct in this case  and as such graph plotted with second resolution would have straight line of "100" going until  value 200 is reported at T10 ?


Is this correct ? 




Brian Brazil

unread,
Jan 2, 2016, 12:43:30 PM1/2/16
to Peter Zaitsev, Prometheus Developers
This is correct. At T=7 we don't have the T=10 value yet, so the best we can do is return 100/s.  We also don't want the values at a given point to change based on future points, as that'd make computations non-repeatable and hard to debug.

--

Peter Zaitsev

unread,
Jan 2, 2016, 12:48:00 PM1/2/16
to Brian Brazil, Prometheus Developers
Brian,

Thanks.  Now it makes sense.   This is a bummer... which makes it pretty useless for sparse series.

This also makes dashboards to be very hard to be done when sample rate is unknown. 

One semi solution here is to pass very short interval to irate so it can at least report no data available instead of showing the last known increase when 

I wish we would have some function as  firate() or something like that which reports the rate of the interval  to which this data point belongs not the previous interval as it is now



Peter Zaitsev

unread,
Jan 2, 2016, 12:48:04 PM1/2/16
to Brian Brazil, Prometheus Developers
Brian,

For irate() there is actually very simple solution for this problem  as it needs to look only one data point upfront it is not a big deal to say the value for  T is not available until there is data point with T1>T exists.     It delays what you can plot on graphs generally by one sample intervals but make them much better for my taste not reporting any stale data. 

If one would like to implement such function in prometheus in addition to irate what one would do ? 

Brian Brazil

unread,
Jan 2, 2016, 12:59:39 PM1/2/16
to Peter Zaitsev, Prometheus Developers
On Sat, Jan 2, 2016 at 5:48 PM, Peter Zaitsev <p...@percona.com> wrote:
Brian,

For irate() there is actually very simple solution for this problem  as it needs to look only one data point upfront it is not a big deal to say the value for  T is not available until there is data point with T1>T exists.     It delays what you can plot on graphs generally by one sample intervals but make them much better for my taste not reporting any stale data. 

If one would like to implement such function in prometheus in addition to irate what one would do ? 

Such a function doesn't really fit the computation model, as it goes against how we think about range vectors and breaks the invariant that values don't change as newer values come in.

There's nothing stopping you from doing this at the graphing layer though with rate(), by increasing the timestamps you're requesting by half the period. 

Brian
 

On Sat, Jan 2, 2016 at 12:43 PM, Brian Brazil <brian....@robustperception.io> wrote:
On Sat, Jan 2, 2016 at 5:37 PM, Peter Zaitsev <p...@percona.com> wrote:
Brian,

Thinking a little bit more about that....

What you seems to be saying is what Prometheus does not map data to the interval in the way I would think, not how irate() computes it ? 

Lets look at the following  times and value

T     V 

1    - 100
2    - 200
10   - 1800

As available samples with 8 seconds skipped

If we  would look at mathematically correct answer on the question - what rate was at the at the T=7  we can see it belongs to   interval 2..10   where average rate was 200/sec,  however is I understand what you're saying correctly Prometheus will report last computed value (based on 1..2 interval rate)  until  T=10,  so reported rate for T=7 will be   100 not 200 as would be correct in this case  and as such graph plotted with second resolution would have straight line of "100" going until  value 200 is reported at T10 ?


Is this correct ? 

This is correct. At T=7 we don't have the T=10 value yet, so the best we can do is return 100/s.  We also don't want the values at a given point to change based on future points, as that'd make computations non-repeatable and hard to debug.

--



--
Peter Zaitsev, CEO, Percona
Tel: +1 888 401 3401 ext 7360   Skype:  peter_zaitsev






--

Peter Zaitsev

unread,
Jan 2, 2016, 1:55:44 PM1/2/16
to Brian Brazil, Prometheus Developers
Brian,

Understood.  The problem with rate is unlike irate it is impossible to make it to provide the information at the maximum resolution available.

I'm looking to build dashboards which are independent on capture rate and completely zoomable.  



Reply all
Reply to author
Forward
0 new messages