Hi,I have been testing Prometheus with two nodes with wireless connection between them and as network went bad for some time I found bad sample came in showing disk IO bandwidth spiking to multiple GB/secand whole bunch of different values outside of what you would expect.How does Prometheus deals with very long sampling ?
For example what if request to remote node_exporter took 60 seconds to complete ?
Lets assume the stall was on network side which means the actual snapshot can correspond to any time within those 60 seconds.How does Prometheus knows where ?
Note even if you put the timestamp in the metrics exporter it does not remove the problem as there are cases when MySQL can stall processing SHOW STATUS for large number of seconds.One way we dealt with this issue in other project is to validate the samples - if something is elected to be scrapped every 5 sec the sample would be only accepted if it took 1 sec or less, as we know in this case it would not be over 20% offIs there some similar feature in prometheus I can enable ?Generally I think missed samples because of network issues, overload etc much better than misleading information
--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-devel...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
Brian,Network is just one example. Looking specifically at MySQL at times of high load it can take a long time to respond providing bizarre spikes in the data. Really pretty much for everything we're not dealing with hard realtime systems so it can respond with significant delay with some probability. Not a big deal if you're just interested in average but if you want to "zoom in" for 1 second resolution it becomes hard to see what is noise and what is data.
I understand that,I think we're speaking about different things here. If some information _Usually_ takes 10 seconds to sample of course there is no point to sample it at 1sec rateHowever if you look at MySQL and simple SHOW STATUS sample it is very feasible at 1sec resolution and is often very helpful to problem diagnosticsThis is however exactly when it is important to know what is the real data and what is noise of invalid capture. For example I might well correlate Application timeoutsto the query spike where 50.000 QPS is reported instead of 5.000 QPS which is normal. This well may be reality due to some stalls on application level for example,or it may be some other problem caused SHOW STATUS to be slow to return giving us wrong data and setting it on wild goose chase.If there is no solution as of right now. I just wanted to make sure it is understood what is the problem and why it is important for me.
I understand where you're coming from. I don't think there's any good solution at the Prometheus or exporter level. The true solution is to make the application not overloaded, which obviously has it's own challenges. This is the reason we have scrape_duration_seconds, to help detect overloaded systems when the system itself can't be trusted.Throwing away data will cause artifacts with irate. Messing around with timestamps only moves the problem around, exposing you to slightly different races. Both of these are likely to cause more problems than they solve.At a 1s scrape there should be a 1s timeout. At worst you should see a spike 2x of the true value. Using rate() rather than irate() will smooth out such things out a bit too.
I understand where you're coming from. I don't think there's any good solution at the Prometheus or exporter level. The true solution is to make the application not overloaded, which obviously has it's own challenges. This is the reason we have scrape_duration_seconds, to help detect overloaded systems when the system itself can't be trusted.Throwing away data will cause artifacts with irate. Messing around with timestamps only moves the problem around, exposing you to slightly different races. Both of these are likely to cause more problems than they solve.At a 1s scrape there should be a 1s timeout. At worst you should see a spike 2x of the true value. Using rate() rather than irate() will smooth out such things out a bit too.You're right 1s scrape with 1s timeout will not cause things to be off more than double. I would like to have data more trusted having it not to be more than 20% off from reality.Too bad it currently can't be configuredI do not think messing with timestamp is just moving problem around.
One is stating I have 50000 queries in the given time which is obviously misleading any kind of further processing of the data which can be going.
If however you simply do not have any data for the given period of time it is fine as such systems typically not design expecting absolutely 100% of samples being presentIf I understand how irate() works correctly it suppose to simply skip such sample. Ie If I have counter of queries of 100K at 0 and when few samples had to be thrown out and 10 seconds later I get counter of 150K the irate will see the change of 50K over 10 seconds period and report rate of 50000 - exactly what I'm looking for. Of course I will not have any details of at which of these 10 seconds those queries came but it is OK
--
I do not think messing with timestamp is just moving problem around.Consider what happens if a long pause happens in the middle of the SHOW STATUS. No matter what you do, at least half the values will have the wrong timestamp. This is an unavoidable race condition. There's also the problem of distributed time synchronisation.
One is stating I have 50000 queries in the given time which is obviously misleading any kind of further processing of the data which can be going.If however you simply do not have any data for the given period of time it is fine as such systems typically not design expecting absolutely 100% of samples being presentIf I understand how irate() works correctly it suppose to simply skip such sample. Ie If I have counter of queries of 100K at 0 and when few samples had to be thrown out and 10 seconds later I get counter of 150K the irate will see the change of 50K over 10 seconds period and report rate of 50000 - exactly what I'm looking for. Of course I will not have any details of at which of these 10 seconds those queries came but it is OKIt'll continue to report 50k until it gets another value, and in the gap will report the old value. It reports the rate looking back in time, not the rate at a given time.

I do not think messing with timestamp is just moving problem around.Consider what happens if a long pause happens in the middle of the SHOW STATUS. No matter what you do, at least half the values will have the wrong timestamp. This is an unavoidable race condition. There's also the problem of distributed time synchronisation.I'm confused where do you see the time synchronization problem here ? As I understand exporters do not even report the time. They report certain variable value which prometheus assigns timestamp based on when it got the data. What do you see as being synchronized here ?
What should happen if SHOW STATUS stalls in the middle ? In this case in my opinion we do not get complete sample and it should be discardedOne is stating I have 50000 queries in the given time which is obviously misleading any kind of further processing of the data which can be going.If however you simply do not have any data for the given period of time it is fine as such systems typically not design expecting absolutely 100% of samples being presentIf I understand how irate() works correctly it suppose to simply skip such sample. Ie If I have counter of queries of 100K at 0 and when few samples had to be thrown out and 10 seconds later I get counter of 150K the irate will see the change of 50K over 10 seconds period and report rate of 50000 - exactly what I'm looking for. Of course I will not have any details of at which of these 10 seconds those queries came but it is OKIt'll continue to report 50k until it gets another value, and in the gap will report the old value. It reports the rate looking back in time, not the rate at a given time.Are you sure about that ? In my experiments if I make some samples unavailable (ie pausing exporter for a while) irate() seems to deal quite OK withHere is example of the graph from grafana - I stopped the "node exporter" on this node so some samples were not available, yet it looks like average bandwidth for this time was computed properly:This is the query which was used:irate(node_disk_sectors_read{alias="$host",device!~"dm-*"}[5m])*512
Brian,Can you please be more specific in terms of math/or code when you say It does not mean what I think it means.I have worked with enough monitoring tools in the industry to have understanding how the rate is going to be computed for sparse time series.This is done the following wayIf we have 2 samples with values N1 and N2 corresponding to times N1 and T2T1 N1T2 N2Instant Rate = (N2-N1)/(T2-T1)Does Prometheus does it some other rate, how ?
Brian,
For irate() there is actually very simple solution for this problem as it needs to look only one data point upfront it is not a big deal to say the value for T is not available until there is data point with T1>T exists. It delays what you can plot on graphs generally by one sample intervals but make them much better for my taste not reporting any stale data.If one would like to implement such function in prometheus in addition to irate what one would do ?
--On Sat, Jan 2, 2016 at 12:43 PM, Brian Brazil <brian....@robustperception.io> wrote:On Sat, Jan 2, 2016 at 5:37 PM, Peter Zaitsev <p...@percona.com> wrote:Brian,Thinking a little bit more about that....What you seems to be saying is what Prometheus does not map data to the interval in the way I would think, not how irate() computes it ?Lets look at the following times and valueT V1 - 1002 - 20010 - 1800As available samples with 8 seconds skippedIf we would look at mathematically correct answer on the question - what rate was at the at the T=7 we can see it belongs to interval 2..10 where average rate was 200/sec, however is I understand what you're saying correctly Prometheus will report last computed value (based on 1..2 interval rate) until T=10, so reported rate for T=7 will be 100 not 200 as would be correct in this case and as such graph plotted with second resolution would have straight line of "100" going until value 200 is reported at T10 ?Is this correct ?This is correct. At T=7 we don't have the T=10 value yet, so the best we can do is return 100/s. We also don't want the values at a given point to change based on future points, as that'd make computations non-repeatable and hard to debug.--Brian Brazil