Collecting round trip time and packet loss

Michael Stapelberg

unread,

Aug 27, 2016, 2:19:20 PM8/27/16

to Prometheus Developers

Hey,

Recently, I’ve had a network partition which only affected the link between 2 specific nodes out of our 3 total nodes. It took me a while to diagnose that situation properly, and I’m wondering whether I could make such a diagnosis easier in the future.

In particular, I’m thinking it would be good to monitor the round-trip time (RTT) and packet loss from each node to all other nodes.

While the blackbox_exporter seems to support an ICMP-based probe¹, it doesn’t seem to export metrics about the RTT and packet loss. In fact, I couldn’t find any available exporter which would export these two metrics.

Is there anything I’m missing? Would the blackbox_exporter be the right place to contribute such a feature? If not, where should it go?

Thanks!

① I think the ICMP probe is not granular enough for the specific failure mode I saw: some pings did go through in general, but there was enough packet loss that TCP connections did not work reliably enough for my application to function.

Richard Hartmann

unread,

Aug 27, 2016, 4:49:40 PM8/27/16

to Michael Stapelberg, Prometheus Developers

On Sat, Aug 27, 2016 at 8:19 PM, Michael Stapelberg
<mic...@robustirc.net> wrote:

> In particular, I’m thinking it would be good to monitor the round-trip time
> (RTT) and packet loss from each node to all other nodes.

Just keep in mind what growth charateristics a double fully connected,
directed graph has.

> While the blackbox_exporter seems to support an ICMP-based probe¹, it
> doesn’t seem to export metrics about the RTT and packet loss. In fact, I
> couldn’t find any available exporter which would export these two metrics.
>
> Is there anything I’m missing? Would the blackbox_exporter be the right
> place to contribute such a feature? If not, where should it go?

RTT can be found in scrape_duration_seconds.
Loss is probe_success, but that might not be what you want.

From context, I suspect that you expect something along the lines of
traceroute/mtr. Problem with that is that you need time for this,
which clashes with the way exporters work to some extent. If you do
want to do this, look at the pushgateway.

Or code up your own; we still need a C client library, anyway ;)

Richard

Ben Kochie

unread,

Aug 27, 2016, 5:15:01 PM8/27/16

to Michael Stapelberg, Prometheus Developers

It would be interesting to get metrics out of something like smokeping. That would allow for nice latency histograms.

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Richard Hartmann

unread,

Aug 27, 2016, 5:20:38 PM8/27/16

to Ben Kochie, Michael Stapelberg, Prometheus Developers

On Sat, Aug 27, 2016 at 11:14 PM, Ben Kochie <sup...@gmail.com> wrote:
> It would be interesting to get metrics out of something like smokeping. That
> would allow for nice latency histograms.

Having replaced smokeping with Prometheus at work, I would prefer to
keep this in Prometheus proper.

Richard

Michael Stapelberg

unread,

Aug 29, 2016, 3:59:21 AM8/29/16

to Richard Hartmann, Prometheus Developers

On Sat, Aug 27, 2016 at 10:49 PM, Richard Hartmann <richih.ma...@gmail.com> wrote:

On Sat, Aug 27, 2016 at 8:19 PM, Michael Stapelberg
<mic...@robustirc.net> wrote:

> In particular, I’m thinking it would be good to monitor the round-trip time
> (RTT) and packet loss from each node to all other nodes.

Just keep in mind what growth charateristics a double fully connected,
directed graph has.

Good point :). The application in question is always run with either 3 nodes or 5 nodes, so growth is not an issue.

> While the blackbox_exporter seems to support an ICMP-based probe¹, it
> doesn’t seem to export metrics about the RTT and packet loss. In fact, I
> couldn’t find any available exporter which would export these two metrics.
>
> Is there anything I’m missing? Would the blackbox_exporter be the right
> place to contribute such a feature? If not, where should it go?

RTT can be found in scrape_duration_seconds.
Loss is probe_success, but that might not be what you want.

Thanks for the probe_duration_seconds hint! Unfortunately, the contents of that metric don’t match what I expect. I configured the following module in my blackbox-exporter:

ping_1s:

prober: icmp

timeout: 1s

Then, I configured the following job in my prometheus config:

- job_name: blackbox_ping_vultr

scrape_interval: 1s

metrics_path: /probe

params:

module: [ping_1s]

target: ['vultr.robustirc.net']

scheme: http

static_configs:

- targets:

- blackbox-exporter:9115

When I inspect the probe_duration_seconds metric, I see:

probe_duration_seconds{instance="blackbox-exporter:9115",job="blackbox_ping_vultr"} 0.026064 @1472457056.356

0.02501 @1472457057.356

0.025261 @1472457058.356

0.009373 @1472457059.356

0.009264 @1472457060.356

0.025502 @1472457061.356

0.010381 @1472457062.356

0.009539 @1472457063.356

0.009445 @1472457064.356

0.009403 @1472457065.356

0.009489 @1472457066.356

0.009273 @1472457067.356

0.025229 @1472457068.356

0.033352 @1472457069.356

Compare that to the results which ping/ping6 print:

$ ping vultr.robustirc.net

PING vultr.robustirc.net (45.32.156.109) 56(84) bytes of data.

64 bytes from vultr.robustirc.net (45.32.156.109): icmp_seq=1 ttl=56 time=6.94 ms

64 bytes from vultr.robustirc.net (45.32.156.109): icmp_seq=2 ttl=56 time=6.92 ms

64 bytes from vultr.robustirc.net (45.32.156.109): icmp_seq=3 ttl=56 time=6.89 ms

64 bytes from vultr.robustirc.net (45.32.156.109): icmp_seq=4 ttl=56 time=6.92 ms

64 bytes from vultr.robustirc.net (45.32.156.109): icmp_seq=5 ttl=56 time=6.95 ms

64 bytes from vultr.robustirc.net (45.32.156.109): icmp_seq=6 ttl=56 time=6.92 ms

64 bytes from vultr.robustirc.net (45.32.156.109): icmp_seq=7 ttl=56 time=6.99 ms

64 bytes from vultr.robustirc.net (45.32.156.109): icmp_seq=8 ttl=56 time=7.00 ms

[…]

$ ping6 vultr.robustirc.net

PING vultr.robustirc.net(vultr.robustirc.net) 56 data bytes

64 bytes from vultr.robustirc.net: icmp_seq=1 ttl=56 time=7.02 ms

64 bytes from vultr.robustirc.net: icmp_seq=2 ttl=56 time=6.98 ms

64 bytes from vultr.robustirc.net: icmp_seq=3 ttl=56 time=6.99 ms

64 bytes from vultr.robustirc.net: icmp_seq=4 ttl=56 time=6.99 ms

[…]

So, I would have expected probe_duration_seconds values around 0.00702, but blackbox-exporter’s probe_duration_seconds starts at 0.009 and has massive spikes.

Am I doing something wrong?

Richard Hartmann

unread,

Aug 29, 2016, 4:22:30 AM8/29/16

to Michael Stapelberg, Prometheus Developers

On Mon, Aug 29, 2016 at 9:58 AM, Michael Stapelberg
<mic...@robustirc.net> wrote:
> So, I would have expected probe_duration_seconds values around 0.00702, but
> blackbox-exporter’s probe_duration_seconds starts at 0.009 and has massive
> spikes.

Good point. Brian should know by heart how he implemented it.

As an aside, you really want `mtr -rnc $number` for such stats on cli.

Richard

Brian Brazil

unread,

Aug 29, 2016, 4:27:17 AM8/29/16

to Richard Hartmann, Michael Stapelberg, Prometheus Developers

On 29 August 2016 at 09:22, Richard Hartmann <richih.ma...@gmail.com> wrote:

On Mon, Aug 29, 2016 at 9:58 AM, Michael Stapelberg
<mic...@robustirc.net> wrote:
> So, I would have expected probe_duration_seconds values around 0.00702, but
> blackbox-exporter’s probe_duration_seconds starts at 0.009 and has massive
> spikes.

Good point. Brian should know by heart how he implemented it.

It's a pretty simple implementation. What does strace show in terms of timing?

Brian

As an aside, you really want `mtr -rnc $number` for such stats on cli.

Richard

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--

Brian Brazil

www.robustperception.io

Michael Stapelberg

unread,

Aug 30, 2016, 2:14:01 AM8/30/16

to Brian Brazil, Richard Hartmann, Prometheus Developers

Based on strace, I think the additional latency/variance was caused by DNS lookups. When using the IP address directly, the probe duration hovers at 7ms as expected.

Thanks everyone!

On Mon, Aug 29, 2016 at 10:27 AM, Brian Brazil <brian....@robustperception.io> wrote:

On 29 August 2016 at 09:22, Richard Hartmann <richih.ma...@gmail.com> wrote:
On Mon, Aug 29, 2016 at 9:58 AM, Michael Stapelberg
<mic...@robustirc.net> wrote:
> So, I would have expected probe_duration_seconds values around 0.00702, but
> blackbox-exporter’s probe_duration_seconds starts at 0.009 and has massive
> spikes.

Good point. Brian should know by heart how he implemented it.

It's a pretty simple implementation. What does strace show in terms of timing?

Brian

As an aside, you really want `mtr -rnc $number` for such stats on cli.

Richard

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscri...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

--
Brian Brazil
www.robustperception.io

Richard Hartmann

unread,

Aug 30, 2016, 2:32:36 AM8/30/16

to Michael Stapelberg, Prometheus Developers, Brian Brazil

That raises an important point, though. Wouldn't most people except not to measure this overhead?
Splitting this into two metrics might make sense.

Richard

Sent by mobile; excuse my brevity.

Michael Stapelberg

unread,

Aug 30, 2016, 2:59:12 AM8/30/16

to Richard Hartmann, Prometheus Developers, Brian Brazil

Agreed. Filed https://github.com/prometheus/blackbox_exporter/issues/60

Reply all

Reply to author

Forward