Etcd v3.5.[0-2] is not recommended for production

Marek Siarkowicz

unread,

Mar 29, 2022, 1:05:09 PM3/29/22

to etcd-dev, d...@kubernetes.io

With recent reproduction of data inconsistency issues in #13766, etcd maintainers are no longer recommending v3.5 releases for production. In our testing we have found that if the etcd process is killed under high load, occasionally some committed transactions are not reflected on all the members. The problem affects versions v3.5.0, v3.5.1, v3.5.2.

Recommendations if you are running v3.4.X:

Don't upgrade your etcd clusters to v3.5 until the problem is fixed in the upcoming v3.5.3 release.
There are no breaking changes in API, meaning it’s safe to let v3.5 clients (e.g. the latest Kubernetes releases) talk to v3.4 servers.

Recommendations if you are running v3.5.0, v3.5.1, or v3.5.2:

Enable data corruption check with `--experimental-initial-corrupt-check` flag. The flag is the only reliable automated way of detecting an inconsistency. This mode has seen significant usage in production and is going to be promoted as default in etcd v3.6.
Ensure etcd cluster is not memory pressured or sigkill interrupted, which could lead to processes being disrupted in the middle of business logic and trigger the issue.
Etcd downgrades should be avoided as they are not officially supported and clusters can be safely recovered as long as data corruption check is enabled.

If you have encountered data corruption, please follow instructions on https://etcd.io/docs/v3.5/op-guide/data_corruption/.

Thanks,

etcd-maintainers

Josh Berkus

unread,

Mar 29, 2022, 6:55:57 PM3/29/22

to Marek Siarkowicz, etcd...@googlegroups.com

Marek,

We should also post this to the Etcd blog. Do you need any help with that?

--
-- Josh Berkus
Kubernetes Community Architect
OSPO, OCTO

Marek Siarkowicz

unread,

Apr 24, 2022, 1:51:00 PM4/24/22

to etcd-dev

With release of v3.5.3 and v3.5.4 versions, the data inconsistency issue was addressed. However this doesn't mean that our work is done. We need do everything we can to prevent such issues in the future. As a first step I started writing a postmortem, that will collect all the lessons learned from the incident. Things that went good or bad, places where we got lucky or need improvement. Then we can agree on a list of actions that we should take to make a lasting improvements.

By having public discussion on the incident I hope that we can make the process more thorough. Feedback is welcomed!
I would like to invite everyone to read it and share it with anyone who might be interested.
Link to the postmortem: https://github.com/etcd-io/etcd/blob/main/Documentation/postmortems/v3.5-data-inconsistency.md

Hope we collaborate and make etcd even more reliable!
Thanks,
@serathius

Reply all

Reply to author

Forward