Hello!
I have a 3-node etcd cluster I use for kubernetes. It is externally managed, meaning I manage it myself instead of opting into letting kubernetes manage it for me.
A couple days ago, the cluster went down for the reason of cert problems. Etcd was logging things like:
> Jan 28 04:03:01 etcd3 main.sh[472]: 2024-01-28 04:03:01.664826 I | embed: rejected connection from "
192.168.1.155:34240" (error "remote error: tls: bad certificate", ServerName "")
This is the exact first error message for this issue that one of the nodes logged. Timezone is UTC.
And:
> Jan 28 04:03:07 etcd3 main.sh[472]: 2024-01-28 04:03:07.714300 W | rafthttp: health check for peer ff398589599d4f7f could not connect: x509: certificate has expired or is not yet valid (prober "ROUND_TRIPPER_RAFT_MESSAGE")
This is a rather well known problem. It is quite clear that there is a problem with my etcd-to-etcd certificates, as well as the client certificate Kubernetes uses to talk to etcd.
However, here's what I'm confused about: I inspected the certs in use, and all of them have NotAfter dates in the 2030s, and NotBefore back in 2019 when I generated them.
To me, it smells like there is some kind of hard-coded certificate max age limit inside of etcd, and the age limit is 5 years. Or rather, 5 * 365 days. My certificates were generated on Jan 30th 2019. 5 years later, and NOT accounting for the leap year in 2020, along with a bit of fuzziness for time zones, means the the cluster failed exactly or almost exactly 365*5 days after cert generation.
Etcd cert information for one node, others have the same dates:
> Version: 3 (0x2)
> Signature Algorithm: sha256WithRSAEncryption
> Validity
> Not Before: Jan 30 20:20:00 2019 GMT
> Not After : Jun 28 12:20:00 2030 GMT
Client cert:
> Version: 3 (0x2)
> Signature Algorithm: sha256WithRSAEncryption
> Validity
> Not Before: Jan 30 22:24:00 2019 GMT
> Not After : Jun 28 14:24:00 2030 GMT
What went wrong?
Thanks!
Dave