Istio mTLS scraping fails with "connection reset"

Travis Illig

unread,

Jul 6, 2021, 2:02:56 PM7/6/21

to Prometheus Users

I'm deploying Prometheus using the Helm chart and I have it configured to scrape Istio mTLS-secured pods using the TLS settings specified by the Istio team to do so. Basically what this amounts to is:

Add the Istio sidecar to the Prometheus instance but disable all traffic proxying - you just want to get the certificates from it.
Mount the certificates into the Prometheus container.
Set up your scrape configuration to use the certificates when scraping Istio-enabled pods.

The YAML for the scrape configuration looks like this:

- job_name: "kubernetes-pods-istio-secure"

scheme: https

tls_config:

ca_file: /etc/istio-certs/root-cert.pem

cert_file: /etc/istio-certs/cert-chain.pem

key_file: /etc/istio-certs/key.pem

insecure_skip_verify: true

This totally works using Prometheus v2.20.1 packaged as `prom/prometheus` from Docker Hub.

This fails on Prometheus v2.28.0 packaged as `quay.io/prometheus/prometheus`. Instead of getting a successful scrape, I get "connection reset by peer." I've validated the files are there and properly mounted; they have the expected contents; and there are no Prometheus log messages to indicate anything is amiss.

I've been rolling back slowly to see where it starts working again. I've tried v2.26.0 and it still fails. I thought I'd drop a note in here to see if anyone knows what's up.

Travis Illig

unread,

Jul 6, 2021, 3:01:08 PM7/6/21

to Prometheus Users

I've verified:

v2.20.1 is the last version where the mTLS scraping works.
It doesn't matter which Docker registry you pull from (Docker Hub or quay.io - I've sometimes seen different "versions" of containers based on registry).

Looking at the release notes for v2.21.0 it appears there's a new version of Go used for compilation which includes some changes on how certificates are handled. Unclear if this is what I'm hitting, but it seems worth looking into.

Travis Illig

unread,

Jul 6, 2021, 4:24:13 PM7/6/21

to Prometheus Users

It's not the certificate handling. I tried setting GODEBUG as indicated in the docs and that didn't fix anything. I'm starting to wonder if it's an HTTP/2 issue or something similar but I'm not sure how to determine if that's the problem.

The error message in Prometheus debug logs isn't super helpful, it just seems to indicate a protocol problem.

level=debug ts=2021-07-06T20:00:50.996Z caller=scrape.go:1091 component="scrape manager" scrape_pool=kubernetes-pods-istio-secure target=https://10.244.3.10:9102/metrics msg="Scrape failed" err="Get \"https://10.244.3.10:9102/metrics\": read tcp 10.244.4.85:51794->10.244.3.10:9102: read: connection reset by peer"

Travis Illig

unread,

Jul 7, 2021, 5:07:02 PM7/7/21

to Prometheus Users

I can create an Ubuntu container and verify connectivity to the container metrics endpoint with both curl and openssl:

curl https://10.244.3.10:9102/metrics --cacert /etc/istio-certs/root-cert.pem --cert /etc/istio-certs/cert-chain.pem --key /etc/istio-certs/key.pem --insecure

openssl s_client -connect 10.244.3.10:9102 -cert /etc/istio-certs/cert-chain.pem -key /etc/istio-certs/key.pem -CAfile /etc/istio-certs/root-cert.pem -alpn "istio"

The curl call seems to correctly auto-negotiate the TLS 1.3 comms. The openssl call requires the -alpn "istio" flag to negotiate the protocol at the application layer or it will fail to connect.

The results of my testing (shown below) make me think it's something in Prometheus or the Go stack causing the problem. I don't think it's an OS configuration issue in the container or anything like that. However, I'm not sure how to debug the Prometheus/Go side of things.

A more verbose log from curl shows it will default to HTTP/2 (which I recall seeing is disabled in Prometheus at the moment).

root@sleep-5f98748557-s4wh5:/# curl https://10.244.3.10:9102/metrics --cacert /etc/istio-certs/root-cert.pem --cert /etc/istio-certs/cert-chain.pem --key /etc/istio-certs/key.pem --insecure -v

* Trying 10.244.3.10:9102...

* TCP_NODELAY set

* Connected to 10.244.3.10 (10.244.3.10) port 9102 (#0)

* ALPN, offering h2

* ALPN, offering http/1.1

* successfully set certificate verify locations:

* CAfile: /etc/istio-certs/root-cert.pem

CApath: /etc/ssl/certs

* TLSv1.3 (OUT), TLS handshake, Client hello (1):

* TLSv1.3 (IN), TLS handshake, Server hello (2):

* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):

* TLSv1.3 (IN), TLS handshake, Request CERT (13):

* TLSv1.3 (IN), TLS handshake, Certificate (11):

* TLSv1.3 (IN), TLS handshake, CERT verify (15):

* TLSv1.3 (IN), TLS handshake, Finished (20):

* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):

* TLSv1.3 (OUT), TLS handshake, Certificate (11):

* TLSv1.3 (OUT), TLS handshake, CERT verify (15):

* TLSv1.3 (OUT), TLS handshake, Finished (20):

* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384

* ALPN, server accepted to use h2

* Server certificate:

* subject: [NONE]

* start date: Jul 7 20:21:33 2021 GMT

* expire date: Jul 8 20:21:33 2021 GMT

* issuer: O=cluster.local

* SSL certificate verify ok.

* Using HTTP2, server supports multi-use

* Connection state changed (HTTP/2 confirmed)

* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0

* Using Stream ID: 1 (easy handle 0x564d80d81e10)

> GET /metrics HTTP/2

> Host: 10.244.3.10:9102

> user-agent: curl/7.68.0

> accept: */*

>

* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):

* old SSL session ID is stale, removing

* Connection state changed (MAX_CONCURRENT_STREAMS == 2147483647)!

< HTTP/2 200

I can add --http1.1 to force HTTP/1.1 and it'll still work:

root@sleep-5f98748557-s4wh5:/# curl https://10.244.3.10:9102/metrics --cacert /etc/istio-certs/root-cert.pem --cert /etc/istio-certs/cert-chain.pem --key /etc/istio-certs/key.pem --insecure -v --http1.1

* Trying 10.244.3.10:9102...

* TCP_NODELAY set

* Connected to 10.244.3.10 (10.244.3.10) port 9102 (#0)

* ALPN, offering http/1.1

* successfully set certificate verify locations:

* CAfile: /etc/istio-certs/root-cert.pem

CApath: /etc/ssl/certs

* TLSv1.3 (OUT), TLS handshake, Client hello (1):

* TLSv1.3 (IN), TLS handshake, Server hello (2):

* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):

* TLSv1.3 (IN), TLS handshake, Request CERT (13):

* TLSv1.3 (IN), TLS handshake, Certificate (11):

* TLSv1.3 (IN), TLS handshake, CERT verify (15):

* TLSv1.3 (IN), TLS handshake, Finished (20):

* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):

* TLSv1.3 (OUT), TLS handshake, Certificate (11):

* TLSv1.3 (OUT), TLS handshake, CERT verify (15):

* TLSv1.3 (OUT), TLS handshake, Finished (20):

* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384

* ALPN, server accepted to use http/1.1

* Server certificate:

* subject: [NONE]

* start date: Jul 7 20:21:33 2021 GMT

* expire date: Jul 8 20:21:33 2021 GMT

* issuer: O=cluster.local

* SSL certificate verify ok.

> GET /metrics HTTP/1.1

> Host: 10.244.3.10:9102

> User-Agent: curl/7.68.0

> Accept: */*

>

* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):

* old SSL session ID is stale, removing

* Mark bundle as not supporting multiuse

< HTTP/1.1 200 OK

Since that works it makes me wonder if there's something wrong with the ALPN handling in the way HTTP/2 is disabled at the moment, like maybe it's not negotiating right? I have no idea, I'm mostly grasping at straws.

Travis Illig

unread,

Jul 12, 2021, 12:07:50 PM7/12/21

to Prometheus Users

Just to close the loop here, the issue ended up being that HTTP/2 is disabled.

https://github.com/prometheus/prometheus/issues/9068

Reply all

Reply to author

Forward