I'm trying to get Redis set up in a two-machine HA configuration, using redis-sentinel to control failover. Redis is version 3.2.3, running on CentOS 6.8. In initial setup goes smoothly, and I can get the two machines setup up as master/slave, with an instance of redis-sentinel running on each one. When I test failover by shutting down redis-server on the master machine, failover works properly, as does failing back once I restart redis-server (which properly comes up as a slave), and shut down the new master. So that part is good.
However, once I move to more "real world" testing by shutting down the master machine completely, redis quickly gets confused and stops working - what it looks like is that I end up in a situation where both machines are set to be slave. How long this takes seems to vary - I've seen it happen immediately, but I've also seen it work through one or two "failovers". However, I have never seen it work reliably.
Detail of testing: Two machines, at 10.211.55.100 and 10.22.55.101, with (initially) 100 being master and 101 being slave. Looking at the redis-sentinel log on 101 (slave machine), I see normal operation:
2785:X 07 Oct 10:59:38.425 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
2785:X 07 Oct 10:59:38.425 # Sentinel ID is 49843b0d832503d8f31fc80f1439479ae8f26934
2785:X 07 Oct 10:59:38.425 # +monitor master mymaster 10.211.55.100 6379 quorum 1
Then I shut down the current master (100), which shows this:
2785:X 07 Oct 11:00:08.563 * +slave slave 10.211.55.101:6379 10.211.55.101 6379 @ mymaster 10.211.55.100 6379
2785:X 07 Oct 11:01:39.018 # +sdown master mymaster 10.211.55.100 6379
2785:X 07 Oct 11:01:39.019 # +odown master mymaster 10.211.55.100 6379 #quorum 1/1
2785:X 07 Oct 11:01:39.019 # +new-epoch 1
2785:X 07 Oct 11:01:39.019 # +try-failover master mymaster 10.211.55.100 6379
2785:X 07 Oct 11:01:39.022 # +vote-for-leader 49843b0d832503d8f31fc80f1439479ae8f26934 1
2785:X 07 Oct 11:01:39.022 # +elected-leader master mymaster 10.211.55.100 6379
2785:X 07 Oct 11:01:39.022 # +failover-state-select-slave master mymaster 10.211.55.100 6379
2785:X 07 Oct 11:01:39.074 # +selected-slave slave 10.211.55.101:6379 10.211.55.101 6379 @ mymaster 10.211.55.100 6379
2785:X 07 Oct 11:01:39.075 * +failover-state-send-slaveof-noone slave 10.211.55.101:6379 10.211.55.101 6379 @ mymaster 10.211.55.100 6379
2785:X 07 Oct 11:01:39.146 * +failover-state-wait-promotion slave 10.211.55.101:6379 10.211.55.101 6379 @ mymaster 10.211.55.100 6379
2785:X 07 Oct 11:01:39.999 # +promoted-slave slave 10.211.55.101:6379 10.211.55.101 6379 @ mymaster 10.211.55.100 6379
2785:X 07 Oct 11:01:39.999 # +failover-state-reconf-slaves master mymaster 10.211.55.100 6379
2785:X 07 Oct 11:01:40.072 # +failover-end master mymaster 10.211.55.100 6379
2785:X 07 Oct 11:01:40.072 # +switch-master mymaster 10.211.55.100 6379 10.211.55.101 6379
2785:X 07 Oct 11:01:40.072 * +slave slave 10.211.55.100:6379 10.211.55.100 6379 @ mymaster 10.211.55.101 6379
2785:X 07 Oct 11:01:41.098 # +sdown slave 10.211.55.100:6379 10.211.55.100 6379 @ mymaster 10.211.55.101 6379
...which appears to me to be a good failover to the slave. So far, so good. Then I bring the old "master" back up:
2785:X 07 Oct 11:08:18.130 # -sdown slave 10.211.55.100:6379 10.211.55.100 6379 @ mymaster 10.211.55.101 6379
2785:X 07 Oct 11:08:28.077 * +convert-to-slave slave 10.211.55.100:6379 10.211.55.100 6379 @ mymaster 10.211.55.101 6379
...which also works, apparently converting 100 to a slave. Ok, let's try again. This time, since 101 is now master, we'll shut down 101 and watch the logs on 100:
1757:X 07 Oct 11:15:38.030 # +sdown master mymaster 10.211.55.101 6379
1757:X 07 Oct 11:15:38.030 # +odown master mymaster 10.211.55.101 6379 #quorum 1/1
1757:X 07 Oct 11:15:38.030 # +new-epoch 3
1757:X 07 Oct 11:15:38.030 # +try-failover master mymaster 10.211.55.101 6379
1757:X 07 Oct 11:15:38.034 # +vote-for-leader 49843b0d832503d8f31fc80f1439479ae8f26934 3
1757:X 07 Oct 11:15:38.034 # +elected-leader master mymaster 10.211.55.101 6379
1757:X 07 Oct 11:15:38.034 # +failover-state-select-slave master mymaster 10.211.55.101 6379
1757:X 07 Oct 11:15:38.087 # +selected-slave slave 10.211.55.100:6379 10.211.55.100 6379 @ mymaster 10.211.55.101 6379
1757:X 07 Oct 11:15:38.087 * +failover-state-send-slaveof-noone slave 10.211.55.100:6379 10.211.55.100 6379 @ mymaster 10.211.55.101 6379
1757:X 07 Oct 11:15:38.153 * +failover-state-wait-promotion slave 10.211.55.100:6379 10.211.55.100 6379 @ mymaster 10.211.55.101 6379
1757:X 07 Oct 11:15:38.984 # +promoted-slave slave 10.211.55.100:6379 10.211.55.100 6379 @ mymaster 10.211.55.101 6379
1757:X 07 Oct 11:15:38.984 # +failover-state-reconf-slaves master mymaster 10.211.55.101 6379
1757:X 07 Oct 11:15:39.032 # +failover-end master mymaster 10.211.55.101 6379
1757:X 07 Oct 11:15:39.032 # +switch-master mymaster 10.211.55.101 6379 10.211.55.100 6379
1757:X 07 Oct 11:15:39.032 * +slave slave 10.211.55.101:6379 10.211.55.101 6379 @ mymaster 10.211.55.100 6379
1757:X 07 Oct 11:15:40.081 # +sdown slave 10.211.55.101:6379 10.211.55.101 6379 @ mymaster 10.211.55.100 6379
Again, looks good. Failed back to 100 as the new master, at least apparently. But look what happens when we bring 101 back up again:
1757:X 07 Oct 11:18:46.495 # -sdown slave 10.211.55.101:6379 10.211.55.101 6379 @ mymaster 10.211.55.100 6379
1757:X 07 Oct 11:18:56.402 * +convert-to-slave slave 10.211.55.101:6379 10.211.55.101 6379 @ mymaster 10.211.55.100 6379
1757:X 07 Oct 11:19:17.579 # +sdown master mymaster 10.211.55.100 6379
1757:X 07 Oct 11:19:17.579 # +odown master mymaster 10.211.55.100 6379 #quorum 1/1
1757:X 07 Oct 11:19:17.579 # +new-epoch 4
1757:X 07 Oct 11:19:17.580 # +try-failover master mymaster 10.211.55.100 6379
1757:X 07 Oct 11:19:17.635 # +vote-for-leader 49843b0d832503d8f31fc80f1439479ae8f26934 4
1757:X 07 Oct 11:19:17.635 # +elected-leader master mymaster 10.211.55.100 6379
1757:X 07 Oct 11:19:17.635 # +failover-state-select-slave master mymaster 10.211.55.100 6379
1757:X 07 Oct 11:19:17.725 # -failover-abort-no-good-slave master mymaster 10.211.55.100 6379
1757:X 07 Oct 11:19:17.826 # Next failover delay: I will not start a failover before Fri Oct 7 11:20:18 2016
1757:X 07 Oct 11:20:18.086 # +new-epoch 5
1757:X 07 Oct 11:20:18.086 # +try-failover master mymaster 10.211.55.100 6379
1757:X 07 Oct 11:20:18.090 # +vote-for-leader 49843b0d832503d8f31fc80f1439479ae8f26934 5
1757:X 07 Oct 11:20:18.090 # +elected-leader master mymaster 10.211.55.100 6379
1757:X 07 Oct 11:20:18.090 # +failover-state-select-slave master mymaster 10.211.55.100 6379
1757:X 07 Oct 11:20:18.154 # -failover-abort-no-good-slave master mymaster 10.211.55.100 6379
1757:X 07 Oct 11:20:18.254 # Next failover delay: I will not start a failover before Fri Oct 7 11:21:18 2016
The first two lines look good - it sees 101 come back up, and tries to convert it to a slave, which it should. But then it apparently sees *itself* go down (which is odd, since it was running fine as master before bringing 101 back up), starts a new epoch, elects itself as master (again, since it already was the master), aborts failover with a non-good-slave error, and repeats indefinitely. At least, if I am reading the logs correctly. What I can say for sure is that at this point *both* redis-server instances are marked as slaves, and there *is* no master.
The sentinel log on the other machine (101), shows a nearly identical pattern - except it seems to think it is still the master, and tries to convert 100 to slave:
1423:X 07 Oct 11:18:46.363 # Sentinel ID is 49843b0d832503d8f31fc80f1439479ae8f26934
1423:X 07 Oct 11:18:46.363 # +monitor master mymaster 10.211.55.101 6379 quorum 1
1423:X 07 Oct 11:18:56.372 * +convert-to-slave slave 10.211.55.100:6379 10.211.55.100 6379 @ mymaster 10.211.55.101 6379
1423:X 07 Oct 11:19:27.446 # +sdown master mymaster 10.211.55.101 6379
1423:X 07 Oct 11:19:27.446 # +odown master mymaster 10.211.55.101 6379 #quorum 1/1
1423:X 07 Oct 11:19:27.446 # +new-epoch 3
1423:X 07 Oct 11:19:27.446 # +try-failover master mymaster 10.211.55.101 6379
1423:X 07 Oct 11:19:27.463 # +vote-for-leader 49843b0d832503d8f31fc80f1439479ae8f26934 3
1423:X 07 Oct 11:19:27.463 # +elected-leader master mymaster 10.211.55.101 6379
1423:X 07 Oct 11:19:27.463 # +failover-state-select-slave master mymaster 10.211.55.101 6379
1423:X 07 Oct 11:19:27.519 # -failover-abort-no-good-slave master mymaster 10.211.55.101 6379
1423:X 07 Oct 11:19:27.603 # Next failover delay: I will not start a failover before Fri Oct 7 11:20:28 2016
1423:X 07 Oct 11:20:28.382 # +new-epoch 4
1423:X 07 Oct 11:20:28.382 # +try-failover master mymaster 10.211.55.101 6379
1423:X 07 Oct 11:20:28.385 # +vote-for-leader 49843b0d832503d8f31fc80f1439479ae8f26934 4
1423:X 07 Oct 11:20:28.385 # +elected-leader master mymaster 10.211.55.101 6379
1423:X 07 Oct 11:20:28.385 # +failover-state-select-slave master mymaster 10.211.55.101 6379
1423:X 07 Oct 11:20:28.452 # -failover-abort-no-good-slave master mymaster 10.211.55.101 6379
1423:X 07 Oct 11:20:28.507 # Next failover delay: I will not start a failover before Fri Oct 7 11:21:28 2016
So what's going on here? It looks like both sentinels try to convert the other redis-server to a slave, which works, leaving no master. Is redis-sentinel simply not reliable? Or am I doing something bad? Keep in mind, all I've done above is shutdown the master server a couple of times. Any feedback would be appreciated.