This morning I came in to find 1 whole shard (2 nodes) had failed on my test Redis cluster. The cluster is made up of 6 nodes, 3 masters & 3 slaves.
This meant that 1/3 of the clustered slots were unavailable. It appears that the nodes crashed and were rebooted at around 7am but Redis failed to start on either with:
Jun 13 07:00:40 testb redis[1215]: Unrecoverable error: corrupted cluster config file.
Jun 13 07:00:40 testb systemd: redis.service: main process exited, code=exited, status=1/FAILURE
Jun 13 07:00:40 testb redis-cli: Could not connect to Redis at
127.0.0.1:6379: Connection refused
Jun 13 07:00:40 testb systemd: redis.service: control process exited, code=exited status=1
Jun 13 07:00:40 testb systemd: Unit redis.service entered failed state.
Jun 13 07:00:40 testb systemd: redis.service failed.
Looking at the /etc/redis/nodes.conf (cluster files) it was indeed corrupt:
5fe341c8812d8871fc57f030a03049a76c3e835f
192.168.10.114:6379 slave,fail 654d6848ff1892ac983234917a643336c2de17ad 1465796156340 1465796154329 53 connected
199a2b70a36d40c8a11a9e48a4c7d07a87f0a540
192.168.10.110:6379 master,fail? - 1465796178468 1465796175956 52 connected 5461-10922
4c5ff5cc4b6b611367af1e1e30ca71d4173e5889
192.168.10.113:6379 myself,slave 199a2b70a36d40c8a11a9e48a4c7d07a87f0a540 0 0 50 connected
92d13c4f7f0276ec7e5ede3e31cacbd8ab3d012a
192.168.10.109:6379 slave 8fc4c824986389815cc8e616c3fd892152a1d371 0 1465796278528 51 connected
654d6848ff1892ac983234917a643336c2de17ad
192.168.10.111:6379 master,fail - 1465796147797 1465796145287 53 connected 10923-16383
The only way I managed to resolve was to evict both nodes, re-add them, then re-assign the slots.
To me, it looks like the process crashed when it was writing to the cluster config file - but it happened on both master and slave?? Or was the corrupt cluster config replicated to the slave?
Just wondering if anyone else has experienced this, or maybe know what happened?
Many thanks,
Dave