Collecting round trip time and packet loss

2,921 views
Skip to first unread message

Michael Stapelberg

unread,
Aug 27, 2016, 2:19:20 PM8/27/16
to Prometheus Developers
Hey,

Recently, I’ve had a network partition which only affected the link between 2 specific nodes out of our 3 total nodes. It took me a while to diagnose that situation properly, and I’m wondering whether I could make such a diagnosis easier in the future.

In particular, I’m thinking it would be good to monitor the round-trip time (RTT) and packet loss from each node to all other nodes.

While the blackbox_exporter seems to support an ICMP-based probe¹, it doesn’t seem to export metrics about the RTT and packet loss. In fact, I couldn’t find any available exporter which would export these two metrics.

Is there anything I’m missing? Would the blackbox_exporter be the right place to contribute such a feature? If not, where should it go?

Thanks!

① I think the ICMP probe is not granular enough for the specific failure mode I saw: some pings did go through in general, but there was enough packet loss that TCP connections did not work reliably enough for my application to function.

Richard Hartmann

unread,
Aug 27, 2016, 4:49:40 PM8/27/16
to Michael Stapelberg, Prometheus Developers
On Sat, Aug 27, 2016 at 8:19 PM, Michael Stapelberg
<mic...@robustirc.net> wrote:

> In particular, I’m thinking it would be good to monitor the round-trip time
> (RTT) and packet loss from each node to all other nodes.

Just keep in mind what growth charateristics a double fully connected,
directed graph has.


> While the blackbox_exporter seems to support an ICMP-based probe¹, it
> doesn’t seem to export metrics about the RTT and packet loss. In fact, I
> couldn’t find any available exporter which would export these two metrics.
>
> Is there anything I’m missing? Would the blackbox_exporter be the right
> place to contribute such a feature? If not, where should it go?

RTT can be found in scrape_duration_seconds.
Loss is probe_success, but that might not be what you want.

From context, I suspect that you expect something along the lines of
traceroute/mtr. Problem with that is that you need time for this,
which clashes with the way exporters work to some extent. If you do
want to do this, look at the pushgateway.

Or code up your own; we still need a C client library, anyway ;)


Richard

Ben Kochie

unread,
Aug 27, 2016, 5:15:01 PM8/27/16
to Michael Stapelberg, Prometheus Developers

It would be interesting to get metrics out of something like smokeping. That would allow for nice latency histograms.


--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Richard Hartmann

unread,
Aug 27, 2016, 5:20:38 PM8/27/16
to Ben Kochie, Michael Stapelberg, Prometheus Developers
On Sat, Aug 27, 2016 at 11:14 PM, Ben Kochie <sup...@gmail.com> wrote:
> It would be interesting to get metrics out of something like smokeping. That
> would allow for nice latency histograms.

Having replaced smokeping with Prometheus at work, I would prefer to
keep this in Prometheus proper.


Richard

Michael Stapelberg

unread,
Aug 29, 2016, 3:59:21 AM8/29/16
to Richard Hartmann, Prometheus Developers
On Sat, Aug 27, 2016 at 10:49 PM, Richard Hartmann <richih.ma...@gmail.com> wrote:
On Sat, Aug 27, 2016 at 8:19 PM, Michael Stapelberg
<mic...@robustirc.net> wrote:

> In particular, I’m thinking it would be good to monitor the round-trip time
> (RTT) and packet loss from each node to all other nodes.

Just keep in mind what growth charateristics a double fully connected,
directed graph has.

Good point :). The application in question is always run with either 3 nodes or 5 nodes, so growth is not an issue.
 


> While the blackbox_exporter seems to support an ICMP-based probe¹, it
> doesn’t seem to export metrics about the RTT and packet loss. In fact, I
> couldn’t find any available exporter which would export these two metrics.
>
> Is there anything I’m missing? Would the blackbox_exporter be the right
> place to contribute such a feature? If not, where should it go?

RTT can be found in scrape_duration_seconds.
Loss is probe_success, but that might not be what you want.

Thanks for the probe_duration_seconds hint! Unfortunately, the contents of that metric don’t match what I expect. I configured the following module in my blackbox-exporter:

  ping_1s:
    prober: icmp
    timeout: 1s

Then, I configured the following job in my prometheus config:

- job_name: blackbox_ping_vultr
  scrape_interval: 1s
  metrics_path: /probe
  params:
    module: [ping_1s]
    target: ['vultr.robustirc.net']
  scheme: http
  static_configs:
  - targets:
    - blackbox-exporter:9115

When I inspect the probe_duration_seconds metric, I see:

probe_duration_seconds{instance="blackbox-exporter:9115",job="blackbox_ping_vultr"} 0.026064 @1472457056.356
0.02501 @1472457057.356
0.025261 @1472457058.356
0.009373 @1472457059.356
0.009264 @1472457060.356
0.025502 @1472457061.356
0.010381 @1472457062.356
0.009539 @1472457063.356
0.009445 @1472457064.356
0.009403 @1472457065.356
0.009489 @1472457066.356
0.009273 @1472457067.356
0.025229 @1472457068.356
0.033352 @1472457069.356

Compare that to the results which ping/ping6 print:

PING vultr.robustirc.net (45.32.156.109) 56(84) bytes of data.
64 bytes from vultr.robustirc.net (45.32.156.109): icmp_seq=1 ttl=56 time=6.94 ms
64 bytes from vultr.robustirc.net (45.32.156.109): icmp_seq=2 ttl=56 time=6.92 ms
64 bytes from vultr.robustirc.net (45.32.156.109): icmp_seq=3 ttl=56 time=6.89 ms
64 bytes from vultr.robustirc.net (45.32.156.109): icmp_seq=4 ttl=56 time=6.92 ms
64 bytes from vultr.robustirc.net (45.32.156.109): icmp_seq=5 ttl=56 time=6.95 ms
64 bytes from vultr.robustirc.net (45.32.156.109): icmp_seq=6 ttl=56 time=6.92 ms
64 bytes from vultr.robustirc.net (45.32.156.109): icmp_seq=7 ttl=56 time=6.99 ms
64 bytes from vultr.robustirc.net (45.32.156.109): icmp_seq=8 ttl=56 time=7.00 ms
[…]

64 bytes from vultr.robustirc.net: icmp_seq=1 ttl=56 time=7.02 ms
64 bytes from vultr.robustirc.net: icmp_seq=2 ttl=56 time=6.98 ms
64 bytes from vultr.robustirc.net: icmp_seq=3 ttl=56 time=6.99 ms
64 bytes from vultr.robustirc.net: icmp_seq=4 ttl=56 time=6.99 ms
[…]

So, I would have expected probe_duration_seconds values around 0.00702, but blackbox-exporter’s probe_duration_seconds starts at 0.009 and has massive spikes.

Am I doing something wrong?

Richard Hartmann

unread,
Aug 29, 2016, 4:22:30 AM8/29/16
to Michael Stapelberg, Prometheus Developers
On Mon, Aug 29, 2016 at 9:58 AM, Michael Stapelberg
<mic...@robustirc.net> wrote:
> So, I would have expected probe_duration_seconds values around 0.00702, but
> blackbox-exporter’s probe_duration_seconds starts at 0.009 and has massive
> spikes.

Good point. Brian should know by heart how he implemented it.

As an aside, you really want `mtr -rnc $number` for such stats on cli.


Richard

Brian Brazil

unread,
Aug 29, 2016, 4:27:17 AM8/29/16
to Richard Hartmann, Michael Stapelberg, Prometheus Developers
On 29 August 2016 at 09:22, Richard Hartmann <richih.ma...@gmail.com> wrote:
On Mon, Aug 29, 2016 at 9:58 AM, Michael Stapelberg
<mic...@robustirc.net> wrote:
> So, I would have expected probe_duration_seconds values around 0.00702, but
> blackbox-exporter’s probe_duration_seconds starts at 0.009 and has massive
> spikes.

Good point. Brian should know by heart how he implemented it.

It's a pretty simple implementation. What does strace show in terms of timing?

Brian
 

As an aside, you really want `mtr -rnc $number` for such stats on cli.


Richard

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

Michael Stapelberg

unread,
Aug 30, 2016, 2:14:01 AM8/30/16
to Brian Brazil, Richard Hartmann, Prometheus Developers
Based on strace, I think the additional latency/variance was caused by DNS lookups. When using the IP address directly, the probe duration hovers at 7ms as expected.

Thanks everyone!

On Mon, Aug 29, 2016 at 10:27 AM, Brian Brazil <brian....@robustperception.io> wrote:
On 29 August 2016 at 09:22, Richard Hartmann <richih.ma...@gmail.com> wrote:
On Mon, Aug 29, 2016 at 9:58 AM, Michael Stapelberg
<mic...@robustirc.net> wrote:
> So, I would have expected probe_duration_seconds values around 0.00702, but
> blackbox-exporter’s probe_duration_seconds starts at 0.009 and has massive
> spikes.

Good point. Brian should know by heart how he implemented it.

It's a pretty simple implementation. What does strace show in terms of timing?

Brian
 

As an aside, you really want `mtr -rnc $number` for such stats on cli.


Richard

--
You received this message because you are subscribed to the Google Groups "Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-developers+unsubscri...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--

Richard Hartmann

unread,
Aug 30, 2016, 2:32:36 AM8/30/16
to Michael Stapelberg, Prometheus Developers, Brian Brazil

That raises an important point, though. Wouldn't most people except not to measure this overhead?
Splitting this into two metrics might make sense.

Richard

Sent by mobile; excuse my brevity.

Michael Stapelberg

unread,
Aug 30, 2016, 2:59:12 AM8/30/16
to Richard Hartmann, Prometheus Developers, Brian Brazil
Reply all
Reply to author
Forward
0 new messages