Prometheus is failing to scrape a target pod, but able to "wget" it inside Prometheus pod

Paolo Filippelli

unread,

Feb 19, 2021, 6:41:56 AM2/19/21

to Prometheus Users

Hi,

just asked the same question on IRC but i don't know which is the best place to get support, so I'll ask also here :)

BTW, this is the IRC link: https://matrix.to/#/!HaYTjhTxVqshXFkNfu:matrix.org/$16137341243277ijEwp:matrix.org?via=matrix.org

The Question

I'm seeing a behaviour that I'd very much like to understand, maybe you can help me...we've got a K8s cluster where Prometheus operator is installed (v0.35.1). Prometheus version is v2.11.0

Istio has also been installed in the cluster with the default "PERMISSIVE" mode, as to say that every envoy sidecar accepts plain http traffic.

Everything is deployed in default namespace, and everypod BUT prometheus/alertmanager/grafana is managed by Istio (i.e. the monitoring stack is out of the mesh)

Prometheus can successfully scrape all its targets (defined via ServiceMonitors), every target but 3/4 that it fails to scrape.

For example, from the logs of Prometheus i can see:

level=debug ts=2021-02-19T11:15:55.595Z caller=scrape.go:927 component="scrape manager" scrape_pool=default/divolte/0 target=http://10.172.22.36:7070/metrics msg="Scrape failed" err="server returned HTTP status 503 Service Unavailable"

But if i log into the Prometheus pod i can successully reach the pod that it's failing to scrape

/prometheus $ wget -SqO /dev/null http://10.172.22.36:7070/metrics

HTTP/1.1 200 OK

date: Fri, 19 Feb 2021 11:27:57 GMT

content-type: text/plain; version=0.0.4; charset=utf-8

content-length: 75758

x-envoy-upstream-service-time: 57

server: istio-envoy

connection: close

x-envoy-decorator-operation: divolte-srv.default.svc.cluster.local:7070/*

What am I missing? The scrape configuration is like that

- job_name: default/divolte/0

honor_timestamps: true

scrape_interval: 30s

scrape_timeout: 10s

metrics_path: /metrics

scheme: http

kubernetes_sd_configs:

- role: endpoints

namespaces:

names:

- default

relabel_configs:

- source_labels: [__meta_kubernetes_service_label_app]

separator: ;

regex: divolte

replacement: $1

action: keep

- source_labels: [__meta_kubernetes_service_label_stack]

separator: ;

regex: livestreaming

replacement: $1

action: keep

- source_labels: [__meta_kubernetes_endpoint_port_name]

separator: ;

regex: http-metrics

replacement: $1

action: keep

- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]

separator: ;

regex: Node;(.*)

target_label: node

replacement: ${1}

action: replace

- source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]

separator: ;

regex: Pod;(.*)

target_label: pod

replacement: ${1}

action: replace

- source_labels: [__meta_kubernetes_namespace]

separator: ;

regex: (.*)

target_label: namespace

replacement: $1

action: replace

- source_labels: [__meta_kubernetes_service_name]

separator: ;

regex: (.*)

target_label: service

replacement: $1

action: replace

- source_labels: [__meta_kubernetes_pod_name]

separator: ;

regex: (.*)

target_label: pod

replacement: $1

action: replace

- source_labels: [__meta_kubernetes_service_name]

separator: ;

regex: (.*)

target_label: job

replacement: ${1}

action: replace

- separator: ;

regex: (.*)

target_label: endpoint

replacement: http-metrics

action: replace

Thank you!

Stuart Clark

unread,

Feb 19, 2021, 4:32:33 PM2/19/21

to Paolo Filippelli, Prometheus Users

On 19/02/2021 11:41, Paolo Filippelli wrote:

Hi,

just asked the same question on IRC but i don't know which is the best place to get support, so I'll ask also here :)

BTW, this is the IRC link: https://matrix.to/#/!HaYTjhTxVqshXFkNfu:matrix.org/$16137341243277ijEwp:matrix.org?via=matrix.org

The Question

I'm seeing a behaviour that I'd very much like to understand, maybe you can help me...we've got a K8s cluster where Prometheus operator is installed (v0.35.1). Prometheus version is v2.11.0

Istio has also been installed in the cluster with the default "PERMISSIVE" mode, as to say that every envoy sidecar accepts plain http traffic.

Everything is deployed in default namespace, and everypod BUT prometheus/alertmanager/grafana is managed by Istio (i.e. the monitoring stack is out of the mesh)

Prometheus can successfully scrape all its targets (defined via ServiceMonitors), every target but 3/4 that it fails to scrape.

For example, from the logs of Prometheus i can see:

level=debug ts=2021-02-19T11:15:55.595Z caller=scrape.go:927 component="scrape manager" scrape_pool=default/divolte/0 target=http://10.172.22.36:7070/metrics msg="Scrape failed" err="server returned HTTP status 503 Service Unavailable"

But if i log into the Prometheus pod i can successully reach the pod that it's failing to scrape

/prometheus $ wget -SqO /dev/null http://10.172.22.36:7070/metrics

HTTP/1.1 200 OK

date: Fri, 19 Feb 2021 11:27:57 GMT

content-type: text/plain; version=0.0.4; charset=utf-8

content-length: 75758

x-envoy-upstream-service-time: 57

server: istio-envoy

connection: close

x-envoy-decorator-operation: divolte-srv.default.svc.cluster.local:7070/*

That error message doesn't indicate that there are any problems with getting to the server. It is saying that the server responded with a 503 error code.

Are certain targets consistently failing or do they sometimes work and only sometimes fail?

Are there any access or error logs from the Envoy sidecar or target pod that might shed some light on where that error is coming from?

-- 
Stuart Clark

Paolo Filippelli

unread,

Feb 22, 2021, 4:07:49 AM2/22/21

to Stuart Clark, Prometheus Users

Hi Stuart,

thanks for our answer. I see that the 503 is the actual response of the target, and that means that Prometheus is reaching it while scraping, but obtains this error.

What i cannot understand is what are the differences, in term of "networking path" and "involved piece" between the scraping (that fails) and the "internal wget-ing" that succeeds.

And i cannot also understand how to debug it.

The scraping towards the "divolte" service is failing consistently, also the same for some other 3 / 4 pods.

In "divolte" logs i cannot see anything interesting, nor in main container neither in envoy proxy. Here are the logs:

Main divolte container:

❯ k logs divolte-dpy-594d8cb676-vgd9l prometheus-jmx-exporter
DEBUG: Environment variables set/received...
Service port (metrics): 7070
Destination host: localhost
Destination port: 5555
Rules to appy: divolte
Local JMX: 7071

CONFIG FILE not found, enabling PREPARE_CONFIG feature
Preparing configuration based on environment variables
Configuration preparation completed, final cofiguration dump:
############
---
hostPort: localhost:5555
username:
password:lowercaseOutputName: true
lowercaseOutputLabelNames: true
########
Starting Service..

Istio-proxy

❯ k logs divolte-dpy-594d8cb676-vgd9l istio-proxy -f

2021-02-22T07:41:15.450702Z info xdsproxy disconnected from XDS server: istiod.istio-system.svc:15012
2021-02-22T07:41:15.451182Z warning envoy config StreamAggregatedResources gRPC config stream closed: 0,
2021-02-22T07:41:15.894626Z info xdsproxy Envoy ADS stream established
2021-02-22T07:41:15.894837Z info xdsproxy connecting to upstream XDS server: istiod.istio-system.svc:15012
2021-02-22T08:11:25.679886Z info xdsproxy disconnected from XDS server: istiod.istio-system.svc:15012
2021-02-22T08:11:25.680655Z warning envoy config StreamAggregatedResources gRPC config stream closed: 0,
2021-02-22T08:11:25.936956Z info xdsproxy Envoy ADS stream established
2021-02-22T08:11:25.937120Z info xdsproxy connecting to upstream XDS server: istiod.istio-system.svc:15012
2021-02-22T08:39:56.813543Z info xdsproxy disconnected from XDS server: istiod.istio-system.svc:15012
2021-02-22T08:39:56.814249Z warning envoy config StreamAggregatedResources gRPC config stream closed: 0,
2021-02-22T08:39:57.183354Z info xdsproxy Envoy ADS stream established
2021-02-22T08:39:57.183653Z info xdsproxy connecting to upstream XDS server: istiod.istio-system.svc:150

Paolo Filippelli

unread,

Feb 23, 2021, 6:22:20 AM2/23/21

to Prometheus Users

I've managed to correctly activate istio-proxy logs, and that's what I can see.

This is the log when I wget "inside" the Prometheus container, with success

[2021-02-23T10:58:55.066Z] "GET /metrics HTTP/1.1" 200 - "-" 0 75771 51 50 "-" "Wget" "4dae0790-1a6a-4750-bc33-4617a6fbaf16" "10.172.22.36:7070" "127.0.0.1:7070" inbound|7070|| 127.0.0.1:42380 10.172.22.36:7070 10.172.23.247:38210 - default

This is the log when Prometheus scraping fails

[2021-02-23T10:58:55.536Z] "GET /metrics HTTP/1.1" 503 UC "-" 0 95 53 - "-" "Prometheus/2.11.0" "2c97c597-6a32-44ed-a2fb-c1d37a2644b3" "10.172.22.36:7070" "127.0.0.1:7070" inbound|7070|| 127.0.0.1:42646 10.172.22.36:7070 10.172.23.247:33758 - default

Any clues? Thank you

Message has been deleted

Stuart Clark

unread,

Feb 23, 2021, 5:21:07 PM2/23/21

to Paolo Filippelli, Prometheus Users

On 23/02/2021 11:22, Paolo Filippelli wrote:

I've managed to correctly activate istio-proxy logs, and that's what I can see.

This is the log when I wget "inside" the Prometheus container, with success

[2021-02-23T10:58:55.066Z] "GET /metrics HTTP/1.1" 200 - "-" 0 75771 51 50 "-" "Wget" "4dae0790-1a6a-4750-bc33-4617a6fbaf16" "10.172.22.36:7070" "127.0.0.1:7070" inbound|7070|| 127.0.0.1:42380 10.172.22.36:7070 10.172.23.247:38210 - default

This is the log when Prometheus scraping fails

[2021-02-23T10:58:55.536Z] "GET /metrics HTTP/1.1" 503 UC "-" 0 95 53 - "-" "Prometheus/2.11.0" "2c97c597-6a32-44ed-a2fb-c1d37a2644b3" "10.172.22.36:7070" "127.0.0.1:7070" inbound|7070|| 127.0.0.1:42646 10.172.22.36:7070 10.172.23.247:33758 - default

According to this: https://blog.getambassador.io/understanding-envoy-proxy-and-ambassador-http-access-logs-fee7802a2ec5

The UC response flags indicate "Upstream connection termination", which suggests something on the destination service. I would suggest looking at the logs there - the request ID might be useful to try to find the request?

-- 
Stuart Clark