Difference in scope between smokeping_prober and blackbox_exporter?

693 views
Skip to first unread message

Marcelo Magallón

unread,
Feb 17, 2021, 6:02:10 PM2/17/21
to Prometheus Developers, sup...@gmail.com
Hi,

I'm trying to understand the difference in scope between blackbox_exporter and smokeping_prober.

I was thinking of extending blackbox_exporter with functionality similar to smokeping_prober.

In the context of BBE, the existing ping prober sends exactly one ICMP packet per probe and the interval is controlled by the prometheus scrape interval.

With smokeping_prober it sends ICMP packets at regular intervals and it builds a histogram to be collected by prometheus at the scrape interval.

For BBE, what I was thinking is allowing the user to specify a number of ICMP packets to be sent (and an interval) so that it can present min / max / avg / dev / loss metrics. The number and the interval would have to be very restricted to avoid very long scrape times.

The reason for this is that I don't need the continuous pinging functionality provided by smokeping_prober (very short interval over a long period of time) but I also can't do with just the current BBE functionality (relatively long interval over a long period of time). What I'm looking for is a small number of repetitions spaced at a comparatively long interval so that I can derive a more representative packet loss metric (1 packet lost out of 5 over a 10 minute interval is not the same as 1 packet loss out of 5 over 5 seconds).

Thoughts?

Thanks!

--
Marcelo Magallón

Stuart Clark

unread,
Feb 17, 2021, 6:15:29 PM2/17/21
to Marcelo Magallón, Prometheus Developers, sup...@gmail.com
Is there any reason you can't use the Blackbox Exporter as it currently
is, just decreasing the scrape interval? Prometheus can scrape as
infrequently as every 2 minutes or as frequently as several times a second.

--
Stuart Clark

Marcelo Magallón

unread,
Feb 18, 2021, 9:17:15 AM2/18/21
to Stuart Clark, Prometheus Developers, sup...@gmail.com
On Wed, Feb 17, 2021 at 5:15 PM Stuart Clark <stuart...@jahingo.com> wrote:
> Thoughts?

Is there any reason you can't use the Blackbox Exporter as it currently
is, just decreasing the scrape interval? Prometheus can scrape as
infrequently as every 2 minutes or as frequently as several times a second.

Thanks Stuart,

Decreasing the scope interval with blackbox_exporter would cause it to behave more like smokeping_prober, which is not exactly the thing I'm after.

Also, smokeping_prober creates histograms from the observations, so even if you set up an ICMP check with blackbox_exporter using a very short interval, you wouldn't get the same information from it.

To provide a concrete example, in terms of timestamps, with smokeping_prober you send packets at 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, ... (1 packet every second, with the scrape interval be say 60 seconds).

With current blackbox_exporter you could do something similar by setting the scrape interval to 1 second, but you'd get slightly different information.

With what I'm proposing you'd send packets at say 0, 1, 2, 3, 4, 60, 61, 62, 63, 64, 120, 121, 122, 123, 123, ... (5 packets within 5 seconds, then wait for 55 seconds, send another 5 packets in 5 seconds, etc; the scrape interval would be 60 seconds)

--
Marcelo Magallón

Ben Kochie

unread,
Feb 18, 2021, 2:37:18 PM2/18/21
to Marcelo Magallón, Stuart Clark, Prometheus Developers
The problem with what you're proposing is you're getting an invalid picture of data over time. This is the problem with the original smokeping program that the smokeping prober is trying to solve.

The original smokeping software does exactly what you're talking about. It sends out a burst of 10 packets at the configured interval (in your example, 1 minute). The problem is this does not give you a real picture, because the packets are not evenly spaced.

This is why I made the smokeping_prober work the way it does. It sends a regular stream, but captures the data in a smarter way, as a histogram.

From the histogram data you can only collect the metrics every minute, and generate the same "min / max / avg / dev / loss" values that you're looking for. But the actual values are much more statistically valid, as it's measuring evenly over time.

Marcelo Magallón

unread,
Feb 21, 2021, 9:48:59 PM2/21/21
to Ben Kochie, Stuart Clark, Prometheus Developers
Thanks Ben.

On Thu, Feb 18, 2021 at 1:37 PM Ben Kochie <sup...@gmail.com> wrote:
The problem with what you're proposing is you're getting an invalid picture of data over time. This is the problem with the original smokeping program that the smokeping prober is trying to solve.

The original smokeping software does exactly what you're talking about. It sends out a burst of 10 packets at the configured interval (in your example, 1 minute). The problem is this does not give you a real picture, because the packets are not evenly spaced.

This is why I made the smokeping_prober work the way it does. It sends a regular stream, but captures the data in a smarter way, as a histogram.

From the histogram data you can only collect the metrics every minute, and generate the same "min / max / avg / dev / loss" values that you're looking for. But the actual values are much more statistically valid, as it's measuring evenly over time.

That's fair. I do understand the argument for preferring continuous observations.

The problem I have with the histogram approach (and this is partly due to the current way histograms work in Prometheus) is that I don't know the distribution a priori.

I let smokeping_prober run for a few days against several IP addresses. For a particular one, after 250+ thousand observations, it's telling me that the round trip time is somewhere between 51.2 ms and 102.4 ms. Using the sum and the count from histogram data I can derive an average (not mean) over a short window and it's giving me ~ 60 ms. I happen to know (from the individual observations) that the 95th percentile is also ~ 60 ms, and that's pretty much the 50th percentile (the spread of the observations is very small). The actual min/max/avg from observations is something like 59.1 / 59.7 / 59.4 ms. If I use the data from the histogram the 50th percentile comes out as ~ 77 ms and the 95th percentile as ~ 100 ms. I must be missing something, because I don't see how I would extract the min / max / dev from the available data. I do understand that the standard deviation for this data is unusually small (compared to what you'd expect to see in the wild), but still...

I also have to think of the data size. For 1 ICMP packet every 1 second, I'm at (order of magnitude) 100 MB of data per target per month. Reducing this to 5 packets every 60 seconds I'm down to 10 MB (order of magnitude). This doesn't sound like much for a single target but it does add up.

As a side note, I noticed that smokeping_prober resolves the IP address once. With BBE this happens everytime the probe runs, so I don't have to do anything if I'm monitoring a host where IP addresses might change every now and then.

Thanks again,

Marcelo

Ben Kochie

unread,
Feb 22, 2021, 6:40:31 AM2/22/21
to Marcelo Magallón, Stuart Clark, Prometheus Developers
On Mon, Feb 22, 2021 at 3:48 AM Marcelo Magallón <marcelo....@grafana.com> wrote:
Thanks Ben.

On Thu, Feb 18, 2021 at 1:37 PM Ben Kochie <sup...@gmail.com> wrote:
The problem with what you're proposing is you're getting an invalid picture of data over time. This is the problem with the original smokeping program that the smokeping prober is trying to solve.

The original smokeping software does exactly what you're talking about. It sends out a burst of 10 packets at the configured interval (in your example, 1 minute). The problem is this does not give you a real picture, because the packets are not evenly spaced.

This is why I made the smokeping_prober work the way it does. It sends a regular stream, but captures the data in a smarter way, as a histogram.

From the histogram data you can only collect the metrics every minute, and generate the same "min / max / avg / dev / loss" values that you're looking for. But the actual values are much more statistically valid, as it's measuring evenly over time.

That's fair. I do understand the argument for preferring continuous observations.

The problem I have with the histogram approach (and this is partly due to the current way histograms work in Prometheus) is that I don't know the distribution a priori.

I let smokeping_prober run for a few days against several IP addresses. For a particular one, after 250+ thousand observations, it's telling me that the round trip time is somewhere between 51.2 ms and 102.4 ms. Using the sum and the count from histogram data I can derive an average (not mean) over a short window and it's giving me ~ 60 ms. I happen to know (from the individual observations) that the 95th percentile is also ~ 60 ms, and that's pretty much the 50th percentile (the spread of the observations is very small). The actual min/max/avg from observations is something like 59.1 / 59.7 / 59.4 ms. If I use the data from the histogram the 50th percentile comes out as ~ 77 ms and the 95th percentile as ~ 100 ms. I must be missing something, because I don't see how I would extract the min / max / dev from the available data. I do understand that the standard deviation for this data is unusually small (compared to what you'd expect to see in the wild), but still...

The default histogram buckets in the smokeping_prober cover latency durations from localhost to the moon and back. It's relatively easy to adjust the buckets, and easy enough to get within a reasonable range for your network expectations.

Without knowing exactly what queries you're running, it's hard to say what you're doing. If you're using the histogram count/sum, this will give you the mean value.

There is one known issue with the smokeping_prober right now that I need to fix, the ping library handling of sequence numbers is broken and doesn't wrap correctly.
 

I also have to think of the data size. For 1 ICMP packet every 1 second, I'm at (order of magnitude) 100 MB of data per target per month. Reducing this to 5 packets every 60 seconds I'm down to 10 MB (order of magnitude). This doesn't sound like much for a single target but it does add up.

Yes, this is going to be an issue no matter what you do. I don't see how this relates to any mode of operation.
 

As a side note, I noticed that smokeping_prober resolves the IP address once. With BBE this happens everytime the probe runs, so I don't have to do anything if I'm monitoring a host where IP addresses might change every now and then.

Yes, this is currently intentional, but re-resolving is something I'm planning to do eventually.
 

Thanks again,

Marcelo

Marcelo Magallón

unread,
Feb 23, 2021, 6:05:00 PM2/23/21
to Ben Kochie, Stuart Clark, Prometheus Developers
On Mon, Feb 22, 2021 at 5:40 AM Ben Kochie <sup...@gmail.com> wrote:
The problem I have with the histogram approach (and this is partly due to the current way histograms work in Prometheus) is that I don't know the distribution a priori.

I let smokeping_prober run for a few days against several IP addresses. For a particular one, after 250+ thousand observations, it's telling me that the round trip time is somewhere between 51.2 ms and 102.4 ms. Using the sum and the count from histogram data I can derive an average (not mean) over a short window and it's giving me ~ 60 ms. I happen to know (from the individual observations) that the 95th percentile is also ~ 60 ms, and that's pretty much the 50th percentile (the spread of the observations is very small). The actual min/max/avg from observations is something like 59.1 / 59.7 / 59.4 ms. If I use the data from the histogram the 50th percentile comes out as ~ 77 ms and the 95th percentile as ~ 100 ms. I must be missing something, because I don't see how I would extract the min / max / dev from the available data. I do understand that the standard deviation for this data is unusually small (compared to what you'd expect to see in the wild), but still...

The default histogram buckets in the smokeping_prober cover latency durations from localhost to the moon and back. It's relatively easy to adjust the buckets, and easy enough to get within a reasonable range for your network expectations.

Without knowing exactly what queries you're running, it's hard to say what you're doing. If you're using the histogram count/sum, this will give you the mean value.

histogram_quantile(0.95, rate(smokeping_response_duration_seconds_bucket[1m]))
histogram_quantile(0.50, rate(smokeping_response_duration_seconds_bucket[1m]))
histogram_quantile(0.05, rate(smokeping_response_duration_seconds_bucket[1m]))
increase(smokeping_response_duration_seconds_sum[1m])/increase(smokeping_response_duration_seconds_count[1m])

and yes, I'm using the default buckets: but that's what I said before, I don't know the distribution a priori. Ideally I would generate buckets centered around the expected mean, but that mean is wildly different depending on the target IP address, so I'm left with the problem of having to define too many buckets or buckets that are too wide to provide good estimates for the above quantities, when my original problem was in principle trying to provide a reasonable guestimate for packet loss and variance...


There is one known issue with the smokeping_prober right now that I need to fix, the ping library handling of sequence numbers is broken and doesn't wrap correctly.
 

I also have to think of the data size. For 1 ICMP packet every 1 second, I'm at (order of magnitude) 100 MB of data per target per month. Reducing this to 5 packets every 60 seconds I'm down to 10 MB (order of magnitude). This doesn't sound like much for a single target but it does add up.

Yes, this is going to be an issue no matter what you do. I don't see how this relates to any mode of operation.

I'm sorry I wasn't clear enough...

With the way smokeping_prober works, I can send one packet per second and that produces ~ 100 MB / target / month in traffic.

With what I wrote initially, one burst of 5 packets every 60 seconds, I'm down to 10 MB / target / month.

I could run smokeping_prober with a ping interval of 12 seconds, and I would get the same 10 MB / target / month, but then I go back to my original question: what do I gain by doing this vs adding functionality to blackbox_exporter to send multiple packets per probe?

Thanks again,

Marcelo
Reply all
Reply to author
Forward
0 new messages