HTTP phases explanation wanted

20 views
Skip to first unread message

Houston Heat

unread,
Jan 25, 2022, 6:07:58 AMJan 25
to Prometheus Users

There are several phases of HTTP (github) in Prometheus metric labels while we are probing data via BlackBox exporter:

  1. resolve
  2. connect
  3. tls
  4. processing
  5. transfer

What does "processing" stand for in the context of an exporter? What kind of processing exactly (js/html/json render waiting) if the response results time are coming from "transfer" phase?

Context: We found a sudden increase in latency and availability on one of our sites when checking with BlackBox exporter, but the total latency measured by request tracing and APM (NewRelic) did not change. This is supported by additional geo-distributed blackboxes and synthetic monitoring tools. We are convinced that in general the work of the site has not changed, however, network/hardware/software problems may be observed in the environment of this particular exporter, and therefore we would like to understand how the "processing" stage reflects the picture of the request.

It is quite obvious to me that if the site is unavailable, the check time of probe_duration_seconds will tend to the timeout value, however, a significant difference in the time of the probe_http_duration_seconds{phase="processing"} makes us think not only about network problems, but also about the performance of the exporter itself, for example in case there are a large number of competitive checks.

overall-phases-shift.png phases-shift-without-processing.png

probe_http_duration-shift.png probe-failures.png

Single scrape example: 
# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge

probe_duration_seconds 1.6482566360000002
# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge

probe_http_content_length -1
# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0.00674702
probe_http_duration_seconds{phase="processing"} 1.57984579
probe_http_duration_seconds{phase="resolve"} 0.008215475
probe_http_duration_seconds{phase="tls"} 0.024449581
probe_http_duration_seconds{phase="transfer"} 0.028283079
# HELP probe_http_redirects The number of redirects
# TYPE probe_http_redirects gauge

probe_http_redirects 1


Thank you.

Houston Heat

unread,
Jan 25, 2022, 6:47:53 AMJan 25
to Prometheus Users
I found an issue (https://github.com/prometheus/blackbox_exporter/issues/579) that characterizes a similar change in time for the rest of the phases, except for "processing" - the metric returns the value 0, but this phase, in principle, did not start processing at all.

вторник, 25 января 2022 г. в 14:07:58 UTC+3, Houston Heat:

Houston Heat

unread,
Jan 25, 2022, 6:51:07 AMJan 25
to Prometheus Users
Metric values in case of failure (5 sec timeout):
# HELP probe_dns_lookup_time_seconds Returns the time taken for probe dns lookup in seconds
# TYPE probe_dns_lookup_time_seconds gauge
probe_dns_lookup_time_seconds 0.003334816

# HELP probe_duration_seconds Returns how long the probe took to complete in seconds
# TYPE probe_duration_seconds gauge
probe_duration_seconds 5.000336144
# HELP probe_failed_due_to_regex Indicates if probe failed due to regex
# TYPE probe_failed_due_to_regex gauge
probe_failed_due_to_regex 0

# HELP probe_http_content_length Length of http content response
# TYPE probe_http_content_length gauge
probe_http_content_length 0

# HELP probe_http_duration_seconds Duration of http request by phase, summed over all redirects
# TYPE probe_http_duration_seconds gauge
probe_http_duration_seconds{phase="connect"} 0
probe_http_duration_seconds{phase="processing"} 0
probe_http_duration_seconds{phase="resolve"} 0.003334816
probe_http_duration_seconds{phase="tls"} 0
probe_http_duration_seconds{phase="transfer"} 0

# HELP probe_http_redirects The number of redirects
# TYPE probe_http_redirects gauge
probe_http_redirects 0
# HELP probe_http_ssl Indicates if SSL was used for the final redirect
# TYPE probe_http_ssl gauge
probe_http_ssl 0
# HELP probe_http_status_code Response HTTP status code
# TYPE probe_http_status_code gauge
probe_http_status_code 0
# HELP probe_http_uncompressed_body_length Length of uncompressed response body
# TYPE probe_http_uncompressed_body_length gauge
probe_http_uncompressed_body_length 0
# HELP probe_http_version Returns the version of HTTP of the probe response
# TYPE probe_http_version gauge
probe_http_version 0
# HELP probe_ip_protocol Specifies whether probe ip protocol is IP4 or IP6
# TYPE probe_ip_protocol gauge
probe_ip_protocol 4
# HELP probe_success Displays whether or not the probe was a success
# TYPE probe_success gauge
probe_success 0

вторник, 25 января 2022 г. в 14:47:53 UTC+3, Houston Heat:

Brian Candler

unread,
Jan 25, 2022, 9:45:28 AMJan 25
to Prometheus Users
The code for setting the "processing" metric is here:

"resolve": trace.start to dnsDone
"connect": dnsDone to gotConn (for non-TLS) (*)
"processing": gotConn to responseStart
"transfer": responseStart to trace.end

AFAICS, "processing" is the time between the connection being established (and presumably the request being written), until the beginning of the HTTP response being received.

> makes us think not only about network problems, but also about the performance of the exporter itself, for example in case there are a large number of competitive checks.

It could be, but it seems unlikely.  I'd say it's more likely to be slow response from the remote webserver that you're probing, which is what this metric is intended to measure.

It *could* be network packet loss causing delays and retransmits, but then I'd expect the "connect" and/or "tls" time to be high as well. Similarly, if the box where blackbox_exporter is running were heavily overloaded, then several metrics would likely be high.  If only "processing" is high, then more likely this is a real reflection of the remote webserver being slow to respond.

Some webservers perform reverse DNS lookups on incoming connections.  That's one possible cause for the webserver to be slow in sending a response (and maybe your traces don't start until the webserver has finished this lookup).  But you'll need to drill down more carefully at your side; looking at timestamped packet captures may help.

(*) for TLS, "connect" is from dnsDone to connectDone, and "tls" is from tlsStart to tlsDone.  However, "processing" is still measured from gotConn, and I don't know how that fits in (i.e. before or after TLS).  You may be able to find more from the docs for httptrace.ClientTrace.
Reply all
Reply to author
Forward
0 new messages