Check consistency of ETCD db and wal files.

315 views

Skip to first unread message

Tom K.

unread,

Apr 3, 2022, 6:19:20 PM4/3/22

to etcd-dev

Hello,

Trying to recover a LAB RH'Kubernetes / OpenShift cluster. For what's it worth, I could blow this cluster up but I'm more interested in the tools available to recover such an ETCD cluster from a catastrophic failure, more so from a learning standpoint than anything else. ( There aren't any backups, though I'm well aware these would help. However, I would still not use them, if I had backups, for the sake of learning how one might recover from this situation given only what's on the hosts themselves. )

[root@rhcpm01 member]# ls -altriR
.:
total 0
287310864 drwx------. 2 root root 246 Apr 3 21:49 snap
125829578 drwxr-xr-x. 3 root root 20 Apr 3 21:49 ..
289407386 drwx------. 2 root root 244 Apr 3 21:49 wal
285213447 drwx------. 4 root root 29 Apr 3 21:49 .

./snap:
total 92516
287310871 -rw-r--r--. 1 root root 13283 Jun 22 2021 000000000000048a-00000000016b29da.snap
287310873 -rw-r--r--. 1 root root 13283 Jun 22 2021 000000000000048a-00000000016cb07b.snap
287310877 -rw-r--r--. 1 root root 13283 Jun 22 2021 0000000000000491-00000000016e371c.snap
287310878 -rw-r--r--. 1 root root 13283 Jun 22 2021 0000000000000491-00000000016fbdbd.snap
287310879 -rw-r--r--. 1 root root 13283 Jun 22 2021 0000000000000494-000000000171445e.snap
287310864 drwx------. 2 root root 246 Apr 3 21:49 .
285213447 drwx------. 4 root root 29 Apr 3 21:49 ..
287310868 -rw-------. 1 root root 94654464 Apr 3 21:49 db

./wal:
total 375024
289407398 -rw-------. 1 root root 64002208 Jun 22 2021 000000000000014c-00000000016c59ec.wal
289407401 -rw-------. 1 root root 64001208 Jun 22 2021 000000000000014d-00000000016d735c.wal
289407387 -rw-------. 1 root root 64010352 Jun 22 2021 000000000000014e-00000000016e8734.wal
289407391 -rw-------. 1 root root 64001744 Jun 22 2021 000000000000014f-00000000016fa15d.wal
289407390 -rw-------. 1 root root 64000000 Jun 22 2021 1.tmp
289407402 -rw-------. 1 root root 64000000 Jun 22 2021 0000000000000150-000000000170bb4c.wal
285213447 drwx------. 4 root root 29 Apr 3 21:49 ..
289407386 drwx------. 2 root root 244 Apr 3 21:49 .
[root@rhcpm01 member]

Background:

Storage under the 3 node master and 3 node worker cluster failed, while the host itself remained running. Since ETCD is very much storage dependent, in all likelihood the failure occurred at the time when ETCD was writing to said DB / WAL files. Thereby making this into a very interesting situation. :)

So any ideas, even destructive ones are welcome, to help me along with this 'scientific experiment'. :D

The key question I would like to answer: Can I really recover from this disastrous situation, however difficult it may be?

Some questions about the DB files:

Should they all have the same checksum?

Can I copy one to the other nodes?

Is there any consistency checks or tools I could use on the DB files?

Is there a way to isolate the corruption in the file and remedy it?

is there a way to rollback a transaction on the DB / WAL files?

etc

Cheers,

etcd-disaster.txt

Tom K.

unread,

Apr 3, 2022, 9:27:53 PM4/3/22

to etcd-dev

How to replicate this issue.

1) Setup a VM with remote storage or install a VM on remote storage. ( Be it SAN or NFS )

2) Shut down the SAN or NFS machine.

3) Watch corruption happen in ETCD. (Virtually 100% guaranteed)

Not sure if ETCD, being installed on a remote SAN or NFS mount, will cause the same issue. In this case, the VM residing entirely on SAN or NFS results in this if the server VM continues to run but the storage on which it is running no, disappears, shuts down or there is a kernel panic on the SAN or NFS storage.