Hi,
just asked the same question on IRC but i don't know which is the best place to get support, so I'll ask also here :)
BTW, this is the IRC link: https://matrix.to/#/!HaYTjhTxVqshXFkNfu:matrix.org/$16137341243277ijEwp:matrix.org?via=matrix.org
The Question
I'm seeing a behaviour that I'd very much like to understand, maybe you can help me...we've got a K8s cluster where Prometheus operator is installed (v0.35.1). Prometheus version is v2.11.0
Istio has also been installed in the cluster with the default "PERMISSIVE" mode, as to say that every envoy sidecar accepts plain http traffic.Everything is deployed in default namespace, and everypod BUT prometheus/alertmanager/grafana is managed by Istio (i.e. the monitoring stack is out of the mesh)
Prometheus can successfully scrape all its targets (defined via ServiceMonitors), every target but 3/4 that it fails to scrape.
For example, from the logs of Prometheus i can see:
level=debug ts=2021-02-19T11:15:55.595Z caller=scrape.go:927 component="scrape manager" scrape_pool=default/divolte/0 target=http://10.172.22.36:7070/metrics msg="Scrape failed" err="server returned HTTP status 503 Service Unavailable"
But if i log into the Prometheus pod i can successully reach the pod that it's failing to scrape
/prometheus $ wget -SqO /dev/null http://10.172.22.36:7070/metricsHTTP/1.1 200 OKdate: Fri, 19 Feb 2021 11:27:57 GMTcontent-type: text/plain; version=0.0.4; charset=utf-8content-length: 75758x-envoy-upstream-service-time: 57server: istio-envoyconnection: closex-envoy-decorator-operation: divolte-srv.default.svc.cluster.local:7070/*
That error message doesn't indicate that there are any problems with getting to the server. It is saying that the server responded with a 503 error code.
Are certain targets consistently failing or do they sometimes work and only sometimes fail?
Are there any access or error logs from the Envoy sidecar or
target pod that might shed some light on where that error is
coming from?
-- Stuart Clark
I've managed to correctly activate istio-proxy logs, and that's what I can see.
This is the log when I wget "inside" the Prometheus container, with success
[2021-02-23T10:58:55.066Z] "GET /metrics HTTP/1.1" 200 - "-" 0 75771 51 50 "-" "Wget" "4dae0790-1a6a-4750-bc33-4617a6fbaf16" "10.172.22.36:7070" "127.0.0.1:7070" inbound|7070|| 127.0.0.1:42380 10.172.22.36:7070 10.172.23.247:38210 - default
This is the log when Prometheus scraping fails
[2021-02-23T10:58:55.536Z] "GET /metrics HTTP/1.1" 503 UC "-" 0 95 53 - "-" "Prometheus/2.11.0" "2c97c597-6a32-44ed-a2fb-c1d37a2644b3" "10.172.22.36:7070" "127.0.0.1:7070" inbound|7070|| 127.0.0.1:42646 10.172.22.36:7070 10.172.23.247:33758 - default
According to this:
https://blog.getambassador.io/understanding-envoy-proxy-and-ambassador-http-access-logs-fee7802a2ec5
The UC response flags indicate "Upstream connection termination",
which suggests something on the destination service. I would
suggest looking at the logs there - the request ID might be useful
to try to find the request?
-- Stuart Clark