Unexpected certificate-related failure in 3-node etcd cluster

Skip to first unread message

Dave P

Feb 8, 2024, 4:17:46 AMFeb 8
to etcd-dev

I have a 3-node etcd cluster I use for kubernetes. It is externally managed, meaning I manage it myself instead of opting into letting kubernetes manage it for me.

A couple days ago, the cluster went down for the reason of cert problems. Etcd was logging things like:

> Jan 28 04:03:01 etcd3 main.sh[472]: 2024-01-28 04:03:01.664826 I | embed: rejected connection from "" (error "remote error: tls: bad certificate", ServerName "")

This is the exact first error message for this issue that one of the nodes logged. Timezone is UTC.


> Jan 28 04:03:07 etcd3 main.sh[472]: 2024-01-28 04:03:07.714300 W | rafthttp: health check for peer ff398589599d4f7f could not connect: x509: certificate has expired or is not yet valid (prober "ROUND_TRIPPER_RAFT_MESSAGE")

This is a rather well known problem. It is quite clear that there is a problem with my etcd-to-etcd certificates, as well as the client certificate Kubernetes uses to talk to etcd.

However, here's what I'm confused about: I inspected the certs in use, and all of them have NotAfter dates in the 2030s, and NotBefore back in 2019 when I generated them.

To me, it smells like there is some kind of hard-coded certificate max age limit inside of etcd, and the age limit is 5 years. Or rather, 5 * 365 days. My certificates were generated on Jan 30th 2019. 5 years later, and NOT accounting for the leap year in 2020, along with a bit of fuzziness for time zones, means the the cluster failed exactly or almost exactly 365*5 days after cert generation. 

Etcd cert information for one node, others have the same dates:

> Version: 3 (0x2)
> Signature Algorithm: sha256WithRSAEncryption
> Validity
>     Not Before: Jan 30 20:20:00 2019 GMT
>     Not After : Jun 28 12:20:00 2030 GMT

Client cert:

> Version: 3 (0x2)
> Signature Algorithm: sha256WithRSAEncryption
> Validity
>     Not Before: Jan 30 22:24:00 2019 GMT
>     Not After : Jun 28 14:24:00 2030 GMT

What went wrong?


Dave P

Feb 8, 2024, 1:11:57 PMFeb 8
to etcd-dev
I realize I should have included the following details:

- I'm running etcd 3.3.25 via docker image
- My command line looks like:

> /usr/local/bin/etcd
>     --data-dir=/etcd-data
>     --name node1
>     --peer-cert-file=/etcd-data/cert.pem
>     --peer-key-file=/etcd-data/key.pem
>     --peer-trusted-ca-file=/etcd-data/ca.pem
>     --peer-client-cert-auth
>     --cert-file=/etcd-data/client-cert.pem
>     --key-file=/etcd-data/client-key.pem
>     --trusted-ca-file=/etcd-data/client-ca.pem
>     --advertise-client-urls
>     --listen-peer-urls
>     --listen-client-urls


Marek Siarkowicz

Feb 8, 2024, 1:24:02 PMFeb 8
to etcd-dev
Note to make sure there is no misunderstanding.
Etcd v3.3 is a 6 years old release that went out of support 1.5 year ago. https://github.com/etcd-io/website/pull/601

The main challenge with helping you, is that most etcd community members are no longer using this version. 

Please consider upgrading.
Reply all
Reply to author
0 new messages