> Thoughts?
Is there any reason you can't use the Blackbox Exporter as it currently
is, just decreasing the scrape interval? Prometheus can scrape as
infrequently as every 2 minutes or as frequently as several times a second.
The problem with what you're proposing is you're getting an invalid picture of data over time. This is the problem with the original smokeping program that the smokeping prober is trying to solve.The original smokeping software does exactly what you're talking about. It sends out a burst of 10 packets at the configured interval (in your example, 1 minute). The problem is this does not give you a real picture, because the packets are not evenly spaced.This is why I made the smokeping_prober work the way it does. It sends a regular stream, but captures the data in a smarter way, as a histogram.From the histogram data you can only collect the metrics every minute, and generate the same "min / max / avg / dev / loss" values that you're looking for. But the actual values are much more statistically valid, as it's measuring evenly over time.
Thanks Ben.On Thu, Feb 18, 2021 at 1:37 PM Ben Kochie <sup...@gmail.com> wrote:The problem with what you're proposing is you're getting an invalid picture of data over time. This is the problem with the original smokeping program that the smokeping prober is trying to solve.The original smokeping software does exactly what you're talking about. It sends out a burst of 10 packets at the configured interval (in your example, 1 minute). The problem is this does not give you a real picture, because the packets are not evenly spaced.This is why I made the smokeping_prober work the way it does. It sends a regular stream, but captures the data in a smarter way, as a histogram.From the histogram data you can only collect the metrics every minute, and generate the same "min / max / avg / dev / loss" values that you're looking for. But the actual values are much more statistically valid, as it's measuring evenly over time.That's fair. I do understand the argument for preferring continuous observations.The problem I have with the histogram approach (and this is partly due to the current way histograms work in Prometheus) is that I don't know the distribution a priori.I let smokeping_prober run for a few days against several IP addresses. For a particular one, after 250+ thousand observations, it's telling me that the round trip time is somewhere between 51.2 ms and 102.4 ms. Using the sum and the count from histogram data I can derive an average (not mean) over a short window and it's giving me ~ 60 ms. I happen to know (from the individual observations) that the 95th percentile is also ~ 60 ms, and that's pretty much the 50th percentile (the spread of the observations is very small). The actual min/max/avg from observations is something like 59.1 / 59.7 / 59.4 ms. If I use the data from the histogram the 50th percentile comes out as ~ 77 ms and the 95th percentile as ~ 100 ms. I must be missing something, because I don't see how I would extract the min / max / dev from the available data. I do understand that the standard deviation for this data is unusually small (compared to what you'd expect to see in the wild), but still...
I also have to think of the data size. For 1 ICMP packet every 1 second, I'm at (order of magnitude) 100 MB of data per target per month. Reducing this to 5 packets every 60 seconds I'm down to 10 MB (order of magnitude). This doesn't sound like much for a single target but it does add up.
As a side note, I noticed that smokeping_prober resolves the IP address once. With BBE this happens everytime the probe runs, so I don't have to do anything if I'm monitoring a host where IP addresses might change every now and then.
Thanks again,Marcelo
The problem I have with the histogram approach (and this is partly due to the current way histograms work in Prometheus) is that I don't know the distribution a priori.I let smokeping_prober run for a few days against several IP addresses. For a particular one, after 250+ thousand observations, it's telling me that the round trip time is somewhere between 51.2 ms and 102.4 ms. Using the sum and the count from histogram data I can derive an average (not mean) over a short window and it's giving me ~ 60 ms. I happen to know (from the individual observations) that the 95th percentile is also ~ 60 ms, and that's pretty much the 50th percentile (the spread of the observations is very small). The actual min/max/avg from observations is something like 59.1 / 59.7 / 59.4 ms. If I use the data from the histogram the 50th percentile comes out as ~ 77 ms and the 95th percentile as ~ 100 ms. I must be missing something, because I don't see how I would extract the min / max / dev from the available data. I do understand that the standard deviation for this data is unusually small (compared to what you'd expect to see in the wild), but still...The default histogram buckets in the smokeping_prober cover latency durations from localhost to the moon and back. It's relatively easy to adjust the buckets, and easy enough to get within a reasonable range for your network expectations.Without knowing exactly what queries you're running, it's hard to say what you're doing. If you're using the histogram count/sum, this will give you the mean value.
There is one known issue with the smokeping_prober right now that I need to fix, the ping library handling of sequence numbers is broken and doesn't wrap correctly.I also have to think of the data size. For 1 ICMP packet every 1 second, I'm at (order of magnitude) 100 MB of data per target per month. Reducing this to 5 packets every 60 seconds I'm down to 10 MB (order of magnitude). This doesn't sound like much for a single target but it does add up.Yes, this is going to be an issue no matter what you do. I don't see how this relates to any mode of operation.