Redis Sentinel and fix-slave-config problem: Redis node is getting set as slave of two masters when it should not be.

1,367 views
Skip to first unread message

Nick F

unread,
Oct 22, 2015, 1:45:36 AM10/22/15
to Redis DB
I'm trying to use sentinel for failover in large redis fleet (12 sentinels, 500+ shard of one master and one slave each). I'm encountering a very strange issue where my sentinels repeatedly issue the command +fix-slave-config to certain redis nodes, and the result being certain slaves switching between the correct master and another wrong master. I did not notice this happening at smaller scale, for what it is worth. Any advice on what to fix or further debug?

I've noticed two specific issues:
A) +fix-slave-config messages, as stated above.
B) The sentinel.conf shows certain slaves having two masters (they should only have one)

Part A)

The fleet in it's starting state has a certain slave node XXX.XXX.XXX.177 with a master XXX.XXX.XXX.244 (together, they comprise shard 188 in the fleet). Without any node outages, the master of the slave is switched to XXX.XXX.XXX.96 (master for shard 188) and then back, and then forth. This is verified by sshing into the slave and master nodes and checking redis-cli info. All Redis nodes started in the correct configuration. All Sentinel nodes had the correct configuration in their sentinel.conf. Each Sentinel has the exact same list of masters when I query them after each of these slave->master changes.

Across my 12 sentinels, the following is logged. Every minute, there is a +fix-slave-config message sent:

Sentinel #8: 20096:X 22 Oct 01:41:49.793 * +fix-slave-config slave XXX.XXX.XXX.177:6379 XXX.XXX.XXX.177 6379 @ shard-188 XXX.XXX.XXX.96 6379
Sentinel #1: 9832:X 22 Oct 01:42:50.795 * +fix-slave-config slave XXX.XXX.XXX.177:6379 XXX.XXX.XXX.177 6379 @ shard-172 XXX.XXX.XXX.244 6379
Sentinel #6: 20528:X 22 Oct 01:43:52.458 * +fix-slave-config slave XXX.XXX.XXX.177:6379 XXX.XXX.XXX.177 6379 @ shard-188 XXX.XXX.XXX.96 6379
Sentinel #10: 20650:X 22 Oct 01:43:52.464 * +fix-slave-config slave XXX.XXX.XXX.177:6379 XXX.XXX.XXX.177 6379 @ shard-188 XXX.XXX.XXX.96 6379
Sentinel #2: 20838:X 22 Oct 01:44:53.489 * +fix-slave-config slave XXX.XXX.XXX.177:6379 XXX.XXX.XXX.177 6379 @ shard-172 XXX.XXX.XXX.244 6379

Part B)

Here's the output of the SENTINEL MASTERS command. The strange thing is that shard-188 has two slaves, when in fact it should only have 1. The output looks the same for when XXX.XXX.XXX.177 is under shard-172 and shard-182.

Case 1) XXX.XXX.XXX.244 is master of XXX.XXX.XXX.177

183)  1) "name"
      2) "shard-172"
      3) "ip"
      4) "XXX.XXX.XXX.244"
      5) "port"
      6) "6379"
      7) "runid"
      8) "ca02da1f0002a25a880e6765aed306b1857ae2f7"
      9) "flags"
     10) "master"
     11) "pending-commands"
     12) "0"
     13) "last-ping-sent"
     14) "0"
     15) "last-ok-ping-reply"
     16) "14"
     17) "last-ping-reply"
     18) "14"
     19) "down-after-milliseconds"
     20) "30000"
     21) "info-refresh"
     22) "5636"
     23) "role-reported"
     24) "master"
     25) "role-reported-time"
     26) "17154406"
     27) "config-epoch"
     28) "0"
     29) "num-slaves"
     30) "1"
     31) "num-other-sentinels"
     32) "12"
     33) "quorum"
     34) "7"
     35) "failover-timeout"
     36) "60000"
     37) "parallel-syncs"
     38) "1"
72)  1) "name"
      2) "shard-188"
      3) "ip"
      4) "XXX.XXX.XXX.96"
      5) "port"
      6) "6379"
      7) "runid"
      8) "95cd3a457ef71fc91ff1a1c5a6d5d4496b266167"
      9) "flags"
     10) "master"
     11) "pending-commands"
     12) "0"
     13) "last-ping-sent"
     14) "0"
     15) "last-ok-ping-reply"
     16) "927"
     17) "last-ping-reply"
     18) "927"
     19) "down-after-milliseconds"
     20) "30000"
     21) "info-refresh"
     22) "5333"
     23) "role-reported"
     24) "master"
     25) "role-reported-time"
     26) "17154312"
     27) "config-epoch"
     28) "0"
     29) "num-slaves"
     30) "2"
     31) "num-other-sentinels"
     32) "12"
     33) "quorum"
     34) "7"
     35) "failover-timeout"
     36) "60000"
     37) "parallel-syncs"
     38) "1"

Case 2) XXX.XXX.XXX.96 is master of XXX.XXX.XXX.177

79)  1) "name"
      2) "shard-172"
      3) "ip"
      4) "XXX.XXX.XXX.244"
      5) "port"
      6) "6379"
      7) "runid"
      8) "ca02da1f0002a25a880e6765aed306b1857ae2f7"
      9) "flags"
     10) "master"
     11) "pending-commands"
     12) "0"
     13) "last-ping-sent"
     14) "0"
     15) "last-ok-ping-reply"
     16) "1012"
     17) "last-ping-reply"
     18) "1012"
     19) "down-after-milliseconds"
     20) "30000"
     21) "info-refresh"
     22) "1261"
     23) "role-reported"
     24) "master"
     25) "role-reported-time"
     26) "17059720"
     27) "config-epoch"
     28) "0"
     29) "num-slaves"
     30) "1"
     31) "num-other-sentinels"
     32) "12"
     33) "quorum"
     34) "7"
     35) "failover-timeout"
     36) "60000"
     37) "parallel-syncs"
     38) "1"
273)  1) "name"
      2) "shard-188"
      3) "ip"
      4) "XXX.XXX.XXX.96"
      5) "port"
      6) "6379"
      7) "runid"
      8) "95cd3a457ef71fc91ff1a1c5a6d5d4496b266167"
      9) "flags"
     10) "master"
     11) "pending-commands"
     12) "0"
     13) "last-ping-sent"
     14) "0"
     15) "last-ok-ping-reply"
     16) "886"
     17) "last-ping-reply"
     18) "886"
     19) "down-after-milliseconds"
     20) "30000"
     21) "info-refresh"
     22) "5762"
     23) "role-reported"
     24) "master"
     25) "role-reported-time"
     26) "17059758"
     27) "config-epoch"
     28) "0"
     29) "num-slaves"
     30) "2"
     31) "num-other-sentinels"
     32) "12"
     33) "quorum"
     34) "7"
     35) "failover-timeout"
     36) "60000"
     37) "parallel-syncs"
     38) "1"

My starting sentinel.conf for each sentinel is

maxclients 20000
loglevel notice
logfile "/home/redis/logs/sentinel.log"
sentinel monitor shard-172 redis-b-172  7
sentinel down-after-milliseconds shard-172 30000
sentinel failover-timeout shard-172 60000
sentinel parallel-syncs shard-172 1
....
sentinel monitor shard-188 redis-b-188  7
sentinel down-after-milliseconds shard-188 30000
sentinel failover-timeout shard-188 60000
sentinel parallel-syncs shard-188 1

Here's the resulting sentinel.conf (for all sentinels) after a few minutes- note the two slaves:

sentinel monitor shard-172 XXX.XXX.XXX.244 6379 7
sentinel failover-timeout shard-172 60000
sentinel config-epoch shard-172 0
sentinel leader-epoch shard-172 0
sentinel known-slave shard-172 XXX.XXX.XXX.177 6379 <--- True slave of shard-172
sentinel known-sentinel shard-172 ...
...
sentinel monitor shard-188 XXX.XXX.XXX.96 6379 7
sentinel failover-timeout shard-188 60000
sentinel config-epoch shard-188 0
sentinel leader-epoch shard-188 0
sentinel known-slave shard-188 XXX.XXX.XXX.194 6379 <--- True slave of shard-188
sentinel known-slave shard-188 XXX.XXX.XXX.177 6379
sentinel known-sentinel shard-188 ... 

chilumb...@gmail.com

unread,
Nov 8, 2015, 10:35:51 PM11/8/15
to Redis DB
Sentinel was only good for managing failover for redis master-slave setup. With redis cluster now out, sentinel is not needed as the cluster can handle failover operations on its own. 

In my experience using sentinel with the redis cluster, i noticed a lot of issues you are facing. In some instances, sentinel would somehow force a working master to become a slave of another master, and messing up the cluster all together. For some reason, Sentinel never gets a proper table map of the redis cluster (hence what while the redis cluster has registered the change of a master following a failover, sentinel might miss on this), and in confusion, apparently starts closing client connections on the server and all its replica.

So bottom line is, turn off sentinel and i am sure your issue will go away

The Baldguy

unread,
Nov 9, 2015, 12:07:11 PM11/9/15
to Redis DB


On Sunday, November 8, 2015 at 9:35:51 PM UTC-6, chilumb...@gmail.com wrote:
Sentinel was only good for managing failover for redis master-slave setup. With redis cluster now out, sentinel is not needed as the cluster can handle failover operations on its own. 

This is incorrect. It is true that you don't use Sentinel with a Redis Cluster setup, but it is not true that Sentinel is no longer needed. For everyone NOT using Cluster, Sentinel is still very much applicable.
 

In my experience using sentinel with the redis cluster, i noticed a lot of issues you are facing.

The OP has not specified they are using Cluster.

So bottom line is, turn off sentinel and i am sure your issue will go away

Unless they aren't running Cluster in which case they have more problems.
 
On Thursday, October 22, 2015 at 1:45:36 PM UTC+8, Nick F wrote:
I'm trying to use sentinel for failover in large redis fleet (12 sentinels, 500+ shard of one master and one slave each). I'm encountering a very strange issue where my sentinels repeatedly issue the command +fix-slave-config to certain redis nodes, and the result being certain slaves switching between the correct master and another wrong master. I did not notice this happening at smaller scale, for what it is worth. Any advice on what to fix or further debug?

I've noticed two specific issues:
A) +fix-slave-config messages, as stated above.
B) The sentinel.conf shows certain slaves having two masters (they should only have one)

Part A)

The fleet in it's starting state has a certain slave node XXX.XXX.XXX.177 with a master XXX.XXX.XXX.244 (together, they comprise shard 188 in the fleet). Without any node outages, the master of the slave is switched to XXX.XXX.XXX.96 (master for shard 188) and then back, and then forth. This is verified by sshing into the slave and master nodes and checking redis-cli info. All Redis nodes started in the correct configuration. All Sentinel nodes had the correct configuration in their sentinel.conf. Each Sentinel has the exact same list of masters when I query them after each of these slave->master changes.

This is what I call an "ant problem": you have two (or more) pods (master+slave) intermingled. You indicate this when you show that one of your pods has multiple slaves.
 
Specifically:
Here's the output of the SENTINEL MASTERS command. The strange thing is that shard-188 has two slaves, when in fact it should only have 1.

What you need to do is bring those pods down, configure them correctly, remove them from the Sentinels, then bring them back online and ensure they each have only the one correct slave. Once that is verified, them back into the appropriate Sentinels. 

Now, technically you could use the sentinel reset but you'd be racing conflicting timing so removing them from Sentinel is the way to go IMO.


chilumb...@gmail.com

unread,
Nov 10, 2015, 4:14:41 AM11/10/15
to Redis DB
The Baldguy,

I did not say sentinel is not need for all cases of redis setups. I said exactly what you just said, that is is not needed for the redis cluster like it is
Sentinel was only good for managing failover for redis master-slave setup. With redis cluster now out, sentinel is not needed as the cluster can handle failover operations on its own.

for the redis master-slave setup. Below is what i said"

chilumb...@gmail.com

unread,
Nov 10, 2015, 4:16:19 AM11/10/15
to Redis DB
The Baldguy,

I did not say sentinel is not need for all cases of redis setups. I said exactly what you just said, that is is not needed for the redis cluster like it is for the redis master-slave setup. Below is what i said"

Sentinel was only good for managing failover for redis master-slave setup. With redis cluster now out, sentinel is not needed as the cluster can handle failover operations on its own. 


Reply all
Reply to author
Forward
0 new messages