Unrecoverable error: corrupted cluster config file

Dave Vaughan

unread,

Jun 13, 2016, 9:18:23 AM6/13/16

to Redis DB

This morning I came in to find 1 whole shard (2 nodes) had failed on my test Redis cluster. The cluster is made up of 6 nodes, 3 masters & 3 slaves.

This meant that 1/3 of the clustered slots were unavailable. It appears that the nodes crashed and were rebooted at around 7am but Redis failed to start on either with:

Jun 13 07:00:40 testb redis[1215]: Unrecoverable error: corrupted cluster config file.

Jun 13 07:00:40 testb systemd: redis.service: main process exited, code=exited, status=1/FAILURE

Jun 13 07:00:40 testb redis-cli: Could not connect to Redis at 127.0.0.1:6379: Connection refused

Jun 13 07:00:40 testb systemd: redis.service: control process exited, code=exited status=1

Jun 13 07:00:40 testb systemd: Unit redis.service entered failed state.

Jun 13 07:00:40 testb systemd: redis.service failed.

Looking at the /etc/redis/nodes.conf (cluster files) it was indeed corrupt:

5fe341c8812d8871fc57f030a03049a76c3e835f 192.168.10.114:6379 slave,fail 654d6848ff1892ac983234917a643336c2de17ad 1465796156340 1465796154329 53 connected

199a2b70a36d40c8a11a9e48a4c7d07a87f0a540 192.168.10.110:6379 master,fail? - 1465796178468 1465796175956 52 connected 5461-10922

4c5ff5cc4b6b611367af1e1e30ca71d4173e5889 192.168.10.113:6379 myself,slave 199a2b70a36d40c8a11a9e48a4c7d07a87f0a540 0 0 50 connected

92d13c4f7f0276ec7e5ede3e31cacbd8ab3d012a 192.168.10.109:6379 slave 8fc4c824986389815cc8e616c3fd892152a1d371 0 1465796278528 51 connected

654d6848ff1892ac983234917a643336c2de17ad 192.168.10.111:6379 master,fail - 1465796147797 1465796145287 53 connected 10923-16383

8fc4c824986389815cc8e616c3fd892152a1d371 192.168.10.112:6379 master - 1465796277751 14657962

Note the last row is missing data, plus no "vars" row. Looking at, http://download.redis.io/redis-stable/src/cluster.c it's likely due to the last row have less than 8 args.

The only way I managed to resolve was to evict both nodes, re-add them, then re-assign the slots.

To me, it looks like the process crashed when it was writing to the cluster config file - but it happened on both master and slave?? Or was the corrupt cluster config replicated to the slave?

Just wondering if anyone else has experienced this, or maybe know what happened?

Many thanks,

Dave

Message has been deleted

Dave Vaughan

unread,

Jun 13, 2016, 11:08:45 AM6/13/16

to Redis DB

A bit more info: Redis cluster version 3.0.7 on CentOS 7.2 x64

Tuco

unread,

Jun 14, 2016, 12:42:31 AM6/14/16

to Redis DB

Yes, i had got the same error on redis 3.0.7 on redhat, when the machine crashed, i tried deleting the partial lines and it messed up the configuration, then i had to manually add nodes and assign them the slots.

Dave Vaughan

unread,

Jun 28, 2016, 8:31:22 AM6/28/16

to Redis DB

I've created a post about this, should anyone stumble upon it: http://www.dwjvaughan.com/2016/06/23/recovering-from-a-corrupted-cluster-config-file/

Reply all

Reply to author

Forward