Prometheus is failing to scrape a target pod, but able to "wget" it inside Prometheus pod

1,291 views
Skip to first unread message

Paolo Filippelli

unread,
Feb 19, 2021, 6:41:56 AM2/19/21
to Prometheus Users
Hi,

just asked the same question on IRC but i don't know which is the best place to get support, so I'll ask also here :)

BTW, this is the IRC link: https://matrix.to/#/!HaYTjhTxVqshXFkNfu:matrix.org/$16137341243277ijEwp:matrix.org?via=matrix.org

The Question

 I'm seeing a behaviour that I'd very much like to understand, maybe you can help me...we've got a K8s cluster where Prometheus operator is installed (v0.35.1). Prometheus version is v2.11.0

Istio has also been installed in the cluster with the default "PERMISSIVE" mode, as to say that every envoy sidecar accepts plain http traffic.
Everything is deployed in default namespace, and everypod BUT prometheus/alertmanager/grafana is managed by Istio (i.e. the monitoring stack is out of the mesh)

Prometheus can successfully scrape all its targets (defined via ServiceMonitors), every target but 3/4 that it fails to scrape.

For example, from the logs of Prometheus i can see:

level=debug ts=2021-02-19T11:15:55.595Z caller=scrape.go:927 component="scrape manager" scrape_pool=default/divolte/0 target=http://10.172.22.36:7070/metrics msg="Scrape failed" err="server returned HTTP status 503 Service Unavailable"


But if i log into the Prometheus pod i can successully reach the pod that it's failing to scrape

/prometheus $ wget -SqO /dev/null http://10.172.22.36:7070/metrics
  HTTP/1.1 200 OK
  date: Fri, 19 Feb 2021 11:27:57 GMT
  content-type: text/plain; version=0.0.4; charset=utf-8
  content-length: 75758
  x-envoy-upstream-service-time: 57
  server: istio-envoy
  connection: close
  x-envoy-decorator-operation: divolte-srv.default.svc.cluster.local:7070/*


What am I missing? The scrape configuration is like that

- job_name: default/divolte/0
  honor_timestamps: true
  scrape_interval: 30s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  kubernetes_sd_configs:
  - role: endpoints
    namespaces:
      names:
      - default
  relabel_configs:
  - source_labels: [__meta_kubernetes_service_label_app]
    separator: ;
    regex: divolte
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_service_label_stack]
    separator: ;
    regex: livestreaming
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_port_name]
    separator: ;
    regex: http-metrics
    replacement: $1
    action: keep
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Node;(.*)
    target_label: node
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_endpoint_address_target_kind, __meta_kubernetes_endpoint_address_target_name]
    separator: ;
    regex: Pod;(.*)
    target_label: pod
    replacement: ${1}
    action: replace
  - source_labels: [__meta_kubernetes_namespace]
    separator: ;
    regex: (.*)
    target_label: namespace
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: service
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_pod_name]
    separator: ;
    regex: (.*)
    target_label: pod
    replacement: $1
    action: replace
  - source_labels: [__meta_kubernetes_service_name]
    separator: ;
    regex: (.*)
    target_label: job
    replacement: ${1}
    action: replace
  - separator: ;
    regex: (.*)
    target_label: endpoint
    replacement: http-metrics
    action: replace

Thank you!

Stuart Clark

unread,
Feb 19, 2021, 4:32:33 PM2/19/21
to Paolo Filippelli, Prometheus Users
On 19/02/2021 11:41, Paolo Filippelli wrote:
Hi,

just asked the same question on IRC but i don't know which is the best place to get support, so I'll ask also here :)

BTW, this is the IRC link: https://matrix.to/#/!HaYTjhTxVqshXFkNfu:matrix.org/$16137341243277ijEwp:matrix.org?via=matrix.org

The Question

 I'm seeing a behaviour that I'd very much like to understand, maybe you can help me...we've got a K8s cluster where Prometheus operator is installed (v0.35.1). Prometheus version is v2.11.0

Istio has also been installed in the cluster with the default "PERMISSIVE" mode, as to say that every envoy sidecar accepts plain http traffic.
Everything is deployed in default namespace, and everypod BUT prometheus/alertmanager/grafana is managed by Istio (i.e. the monitoring stack is out of the mesh)

Prometheus can successfully scrape all its targets (defined via ServiceMonitors), every target but 3/4 that it fails to scrape.

For example, from the logs of Prometheus i can see:

level=debug ts=2021-02-19T11:15:55.595Z caller=scrape.go:927 component="scrape manager" scrape_pool=default/divolte/0 target=http://10.172.22.36:7070/metrics msg="Scrape failed" err="server returned HTTP status 503 Service Unavailable"


But if i log into the Prometheus pod i can successully reach the pod that it's failing to scrape

/prometheus $ wget -SqO /dev/null http://10.172.22.36:7070/metrics
  HTTP/1.1 200 OK
  date: Fri, 19 Feb 2021 11:27:57 GMT
  content-type: text/plain; version=0.0.4; charset=utf-8
  content-length: 75758
  x-envoy-upstream-service-time: 57
  server: istio-envoy
  connection: close
  x-envoy-decorator-operation: divolte-srv.default.svc.cluster.local:7070/*

That error message doesn't indicate that there are any problems with getting to the server. It is saying that the server responded with a 503 error code.

Are certain targets consistently failing or do they sometimes work and only sometimes fail?

Are there any access or error logs from the Envoy sidecar or target pod that might shed some light on where that error is coming from?

-- 
Stuart Clark

Paolo Filippelli

unread,
Feb 22, 2021, 4:07:49 AM2/22/21
to Stuart Clark, Prometheus Users
Hi Stuart,

thanks for our answer. I see that the 503 is the actual response of the target, and that means that Prometheus is reaching it while scraping, but obtains this error.
What i cannot understand is what are the differences, in term of "networking path" and "involved piece" between the scraping (that fails) and the "internal wget-ing" that succeeds.
And i cannot also understand how to debug it.

The scraping towards the "divolte" service is failing consistently, also the same for some other 3 / 4 pods.

In "divolte" logs i cannot see anything interesting, nor in main container neither in envoy proxy. Here are the logs:

Main divolte container:
❯ k logs divolte-dpy-594d8cb676-vgd9l prometheus-jmx-exporter
DEBUG: Environment variables set/received...
Service port (metrics): 7070
Destination host: localhost
Destination port: 5555
Rules to appy: divolte
Local JMX: 7071

CONFIG FILE not found, enabling PREPARE_CONFIG feature
Preparing configuration based on environment variables
Configuration preparation completed, final cofiguration dump:
############
---
hostPort: localhost:5555
username:
password:lowercaseOutputName: true
lowercaseOutputLabelNames: true
########
Starting Service..


Istio-proxy
❯ k logs divolte-dpy-594d8cb676-vgd9l istio-proxy -f
2021-02-22T07:41:15.450702Z info xdsproxy disconnected from XDS server: istiod.istio-system.svc:15012
2021-02-22T07:41:15.451182Z warning envoy config StreamAggregatedResources gRPC config stream closed: 0,
2021-02-22T07:41:15.894626Z info xdsproxy Envoy ADS stream established
2021-02-22T07:41:15.894837Z info xdsproxy connecting to upstream XDS server: istiod.istio-system.svc:15012
2021-02-22T08:11:25.679886Z info xdsproxy disconnected from XDS server: istiod.istio-system.svc:15012
2021-02-22T08:11:25.680655Z warning envoy config StreamAggregatedResources gRPC config stream closed: 0,
2021-02-22T08:11:25.936956Z info xdsproxy Envoy ADS stream established
2021-02-22T08:11:25.937120Z info xdsproxy connecting to upstream XDS server: istiod.istio-system.svc:15012
2021-02-22T08:39:56.813543Z info xdsproxy disconnected from XDS server: istiod.istio-system.svc:15012
2021-02-22T08:39:56.814249Z warning envoy config StreamAggregatedResources gRPC config stream closed: 0,
2021-02-22T08:39:57.183354Z info xdsproxy Envoy ADS stream established
2021-02-22T08:39:57.183653Z info xdsproxy connecting to upstream XDS server: istiod.istio-system.svc:150

Paolo Filippelli

unread,
Feb 23, 2021, 6:22:20 AM2/23/21
to Prometheus Users
I've managed to correctly activate istio-proxy logs, and that's what I can see. 

This is the log when I wget "inside" the Prometheus container, with success

[2021-02-23T10:58:55.066Z] "GET /metrics HTTP/1.1" 200 - "-" 0 75771 51 50 "-" "Wget" "4dae0790-1a6a-4750-bc33-4617a6fbaf16" "10.172.22.36:7070" "127.0.0.1:7070" inbound|7070|| 127.0.0.1:42380 10.172.22.36:7070 10.172.23.247:38210 - default

This is the log when Prometheus scraping fails

[2021-02-23T10:58:55.536Z] "GET /metrics HTTP/1.1" 503 UC "-" 0 95 53 - "-" "Prometheus/2.11.0" "2c97c597-6a32-44ed-a2fb-c1d37a2644b3" "10.172.22.36:7070" "127.0.0.1:7070" inbound|7070|| 127.0.0.1:42646 10.172.22.36:7070 10.172.23.247:33758 - default

Any clues? Thank you
Message has been deleted

Stuart Clark

unread,
Feb 23, 2021, 5:21:07 PM2/23/21
to Paolo Filippelli, Prometheus Users
On 23/02/2021 11:22, Paolo Filippelli wrote:
I've managed to correctly activate istio-proxy logs, and that's what I can see. 

This is the log when I wget "inside" the Prometheus container, with success

[2021-02-23T10:58:55.066Z] "GET /metrics HTTP/1.1" 200 - "-" 0 75771 51 50 "-" "Wget" "4dae0790-1a6a-4750-bc33-4617a6fbaf16" "10.172.22.36:7070" "127.0.0.1:7070" inbound|7070|| 127.0.0.1:42380 10.172.22.36:7070 10.172.23.247:38210 - default

This is the log when Prometheus scraping fails

[2021-02-23T10:58:55.536Z] "GET /metrics HTTP/1.1" 503 UC "-" 0 95 53 - "-" "Prometheus/2.11.0" "2c97c597-6a32-44ed-a2fb-c1d37a2644b3" "10.172.22.36:7070" "127.0.0.1:7070" inbound|7070|| 127.0.0.1:42646 10.172.22.36:7070 10.172.23.247:33758 - default

According to this: https://blog.getambassador.io/understanding-envoy-proxy-and-ambassador-http-access-logs-fee7802a2ec5

The UC response flags indicate "Upstream connection termination", which suggests something on the destination service. I would suggest looking at the logs there - the request ID might be useful to try to find the request?


-- 
Stuart Clark
Reply all
Reply to author
Forward
Message has been deleted
Message has been deleted
0 new messages