Google SRE book "The Four Golden Signals" with blackbox_exporter

66 views
Skip to first unread message

Evelyn Pereira Souza

unread,
Nov 5, 2020, 9:36:31 AM11/5/20
to Prometheus Users
Hi

https://landing.google.com/sre/sre-book/chapters/monitoring-distributed-systems/

scroll to "The Four Golden Signals"

Latency
-> my solution: avg_over_time(probe_duration_seconds{"}[15m]) > zz'

Traffic
-> not possible with blackbox

Errors
-> my solution: avg_over_time(probe_success{yyy}[20m]) * 100 < zz

Saturation
-> not possible with blackbox

Is this correct? 2 of 4 golden singals can't be done with blackbox? Any
feedback/improvment to my Latency and Errors query?

regards
Evelyn
OpenPGP_0x61776FA8E38403FB.asc
OpenPGP_signature

Brian Candler

unread,
Nov 5, 2020, 12:47:04 PM11/5/20
to Prometheus Users
On Thursday, 5 November 2020 14:36:31 UTC, Evelyn Pereira Souza wrote:
Is this correct? 2 of 4 golden singals can't be done with blackbox?

I'd say not even that.

Suppose you run blackbox exporter with a scrape interval of 5 seconds - you'll only get 12 measurements per minute.  However your real application may be handling 1000's of requests per second, and these are *real* user requests performing real work.

What you care about is the proportion of those real user requests which give errors, and the latency of those real user requests.  These measurements are *much* more useful than synthetic measurements from blackbox exporter.

Once you've got those real error and latency measurements, then the "traffic" measurement usually falls out in some application-specific way.  e.g. for a web server it could be total number of HTTP requests.

Resource saturation usually depends on some way of measuring the underlying resources, e.g. node_exporter for server resources (CPU, RAM, disk IOPS etc)

blackbox_exporter is mainly useful for testing systems that you don't control, or which don't produce their own logs or metrics - a DNS server is a typical example, or a network link - or for alerting you quickly when a service has gone down completely.

Evelyn Pereira Souza

unread,
Nov 6, 2020, 5:45:38 AM11/6/20
to promethe...@googlegroups.com
On 05.11.20 18:47, Brian Candler wrote:
> What you care about is the proportion of those real user requests which
> give errors, and the latency of those real user requests.  These
> measurements are *much* more useful than synthetic measurements from
> blackbox exporter.

Thank you for explanation. I think you are right.
I think I need to parse Tomcat logs.

regards
Evelyn
OpenPGP_0x61776FA8E38403FB.asc
OpenPGP_signature

Brian Candler

unread,
Nov 6, 2020, 6:17:47 AM11/6/20
to Prometheus Users
mtail and grok_exporter may help you.  Also look at loki for storing your logs: you can generate Prometehus metrics from promtail, and recent features mean you can run periodic LogQL queries to turn them into Prometheus alerts.

Reply all
Reply to author
Forward
0 new messages