With recent reproduction of data inconsistency issues in #13766, etcd maintainers are no longer recommending v3.5 releases for production. In our testing we have found that if the etcd process is killed under high load, occasionally some committed transactions are not reflected on all the members. The problem affects versions v3.5.0, v3.5.1, v3.5.2.
Recommendations if you are running v3.4.X:
Don't upgrade your etcd clusters to v3.5 until the problem is fixed in the upcoming v3.5.3 release.
There are no breaking changes in API, meaning it’s safe to let v3.5 clients (e.g. the latest Kubernetes releases) talk to v3.4 servers.
Recommendations if you are running v3.5.0, v3.5.1, or v3.5.2:
Enable data corruption check with `--experimental-initial-corrupt-check` flag. The flag is the only reliable automated way of detecting an inconsistency. This mode has seen significant usage in production and is going to be promoted as default in etcd v3.6.
Ensure etcd cluster is not memory pressured or sigkill interrupted, which could lead to processes being disrupted in the middle of business logic and trigger the issue.
Etcd downgrades should be avoided as they are not officially supported and clusters can be safely recovered as long as data corruption check is enabled.
If you have encountered data corruption, please follow instructions on https://etcd.io/docs/v3.5/op-guide/data_corruption/.
Thanks,
etcd-maintainers