Conflicting masters in redis cluster using sentinel

111 views
Skip to first unread message

Saurav Mondal

unread,
Sep 11, 2019, 8:16:54 AM9/11/19
to Redis DB
Hi,

I have configured 3 node redis (version: 4.0.11) cluster using sentinel. Redis process and sentinel running on each vm. During upgrade of system, I observed two nodes at the same time claim to be master. 
The sentinel config file is attached. The logs of those two concerned nodes given below. Could you please let me know I if I am doing anything wrong in sentinel config? Or do you see any config related from logs?

Redis log of 10.11.42.168:

4570:S 09 Jul 08:55:57.375 * Connecting to MASTER 10.11.42.170:6379

4570:S 09 Jul 08:55:57.375 * MASTER <-> SLAVE sync started

4570:S 09 Jul 08:55:57.375 * Non blocking connect for SYNC fired the event.

4570:S 09 Jul 08:55:57.375 * Master replied to PING, replication can continue...

4570:S 09 Jul 08:55:57.376 * Trying a partial resynchronization (request 1edf139c5ae3723c8e62dca9dc3bc20bfc068ff8:1254671132).

4570:S 09 Jul 08:55:57.377 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master

4570:M 09 Jul 08:55:57.556 # Setting secondary replication ID to 1edf139c5ae3723c8e62dca9dc3bc20bfc068ff8, valid up to offset: 1254671132. New replication ID is 21395820adc14008568a7c46098f46a15a970c23

4570:M 09 Jul 08:55:57.556 * Discarding previously cached master state.

4570:M 09 Jul 08:55:57.556 * MASTER MODE enabled (user request from 'id=2300 addr=10.11.42.168:42334 fd=21 name= age=0 idle=0 flags=N db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=32768 obl=0 oll=0 omem=0 events=r cmd=slaveof')

4570:M 09 Jul 08:55:57.711 * Slave 10.11.42.169:6379 asks for synchronization

4570:M 09 Jul 08:55:57.711 * Partial resynchronization request from 10.11.42.169:6379 accepted. Sending 0 bytes of backlog starting from offset 1254671132.

4570:M 09 Jul 08:55:58.008 * Slave 10.11.42.170:6379 asks for synchronization

4570:M 09 Jul 08:55:58.008 * Partial resynchronization request from 10.11.42.170:6379 accepted. Sending 141 bytes of backlog starting from offset 1254670991.

4570:M 09 Jul 08:55:59.874 # Connection with slave 10.11.42.169:6379 lost.

4570:S 09 Jul 08:56:00.527 # Connection with slave 10.11.42.170:6379 lost.

4570:S 09 Jul 08:56:00.527 * Before turning into a slave, using my master parameters to synthesize a cached master: I may be able to synchronize with the new master with just a partial transfer.

4570:S 09 Jul 08:56:00.527 * SLAVE OF 10.11.42.169:6379 enabled (user request from 'id=23 addr=10.11.42.168:36468 fd=12 name=sentinel-3af8da37-cmd age=1671 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=141 qbuf-free=32627 obl=36 oll=0 omem=0 events=r cmd=exec')

4570:S 09 Jul 08:56:00.528 # CONFIG REWRITE executed with success.

4570:S 09 Jul 08:56:01.384 * Connecting to MASTER 10.11.42.169:6379

4570:S 09 Jul 08:56:01.384 * MASTER <-> SLAVE sync started

4570:S 09 Jul 08:56:01.384 * Non blocking connect for SYNC fired the event.


Redis log of 10.11.42.170:

4571:S 09 Jul 08:55:57.004 * Connecting to MASTER 10.11.42.168:6379

4571:S 09 Jul 08:55:57.004 * MASTER <-> SLAVE sync started

4571:S 09 Jul 08:55:57.004 * Non blocking connect for SYNC fired the event.

4571:S 09 Jul 08:55:57.004 * Master replied to PING, replication can continue...

4571:S 09 Jul 08:55:57.005 * Trying a partial resynchronization (request 1edf139c5ae3723c8e62dca9dc3bc20bfc068ff8:1254670991).

4571:S 09 Jul 08:55:57.005 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master

4571:S 09 Jul 08:55:58.006 * Connecting to MASTER 10.11.42.168:6379

4571:S 09 Jul 08:55:58.006 * MASTER <-> SLAVE sync started

4571:S 09 Jul 08:55:58.006 * Non blocking connect for SYNC fired the event.

4571:S 09 Jul 08:55:58.007 * Master replied to PING, replication can continue...

4571:S 09 Jul 08:55:58.007 * Trying a partial resynchronization (request 1edf139c5ae3723c8e62dca9dc3bc20bfc068ff8:1254670991).

4571:S 09 Jul 08:55:58.008 * Successful partial resynchronization with master.

4571:S 09 Jul 08:55:58.008 # Master replication ID changed to 21395820adc14008568a7c46098f46a15a970c23

4571:S 09 Jul 08:55:58.008 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.

4571:S 09 Jul 08:56:00.527 # Connection with master lost.

4571:S 09 Jul 08:56:00.527 * Caching the disconnected master state.

4571:S 09 Jul 08:56:01.011 * Connecting to MASTER 10.11.42.168:6379

4571:S 09 Jul 08:56:01.011 * MASTER <-> SLAVE sync started

4571:S 09 Jul 08:56:01.012 * Non blocking connect for SYNC fired the event.

4571:S 09 Jul 08:56:01.012 * Master replied to PING, replication can continue...

4571:S 09 Jul 08:56:01.013 * Trying a partial resynchronization (request 21395820adc14008568a7c46098f46a15a970c23:1254671719).

4571:S 09 Jul 08:56:01.013 * Master is currently unable to PSYNC but should be in the future: -NOMASTERLINK Can't SYNC while not connected with my master


sentinel logs:

4711:X 09 Jul 08:55:15.395 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo
4711:X 09 Jul 08:55:15.395 # Redis version=4.0.11, bits=64, commit=00000000, modified=0, pid=4711, just started
4711:X 09 Jul 08:55:15.395 # Configuration loaded
4712:X 09 Jul 08:55:15.397 * Increased maximum number of open files to 10032 (it was originally set to 1024).
4712:X 09 Jul 08:55:15.398 # Not listening to IPv6: unsupproted
4712:X 09 Jul 08:55:15.399 * Running mode=sentinel, port=26379.
4712:X 09 Jul 08:55:15.399 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
4712:X 09 Jul 08:55:15.401 # Sentinel ID is 1ef665babd98ea2e37836dc200a8a537ae3948ea
4712:X 09 Jul 08:55:15.401 # +monitor master mymaster 10.11.42.168 6379 quorum 2
4712:X 09 Jul 08:55:15.536 * +sentinel sentinel c4ba2e747ab7f3b596901fbae6c6fd47fe2ff724 10.11.42.170 26379 @ mymaster 10.11.42.168 6379
4712:X 09 Jul 08:55:15.540 # +new-epoch 66
4712:X 09 Jul 08:55:15.540 # +config-update-from sentinel c4ba2e747ab7f3b596901fbae6c6fd47fe2ff724 10.11.42.170 26379 @ mymaster 10.11.42.168 6379
4712:X 09 Jul 08:55:15.540 # +switch-master mymaster 10.11.42.168 6379 10.11.42.170 6379
4712:X 09 Jul 08:55:15.540 * +slave slave 10.11.42.168:6379 10.11.42.168 6379 @ mymaster 10.11.42.170 6379
4712:X 09 Jul 08:55:15.887 * +sentinel sentinel 3af8da37608533f2b0247a69afecbefef714ca38 10.11.42.168 26379 @ mymaster 10.11.42.170 6379
4712:X 09 Jul 08:55:29.375 # +new-epoch 67
4712:X 09 Jul 08:55:29.377 # +vote-for-leader 3af8da37608533f2b0247a69afecbefef714ca38 67
4712:X 09 Jul 08:55:52.490 # +sdown master mymaster 10.11.42.170 6379
4712:X 09 Jul 08:55:52.556 # +odown master mymaster 10.11.42.170 6379 #quorum 3/2
4712:X 09 Jul 08:55:52.556 # Next failover delay: I will not start a failover before Tue Jul 9 08:55:59 2019
4712:X 09 Jul 08:55:58.008 * +sentinel-invalid-addr sentinel c4ba2e747ab7f3b596901fbae6c6fd47fe2ff724 10.11.42.170 26379 @ mymaster 10.11.42.170 6379
4712:X 09 Jul 08:55:58.008 * +sentinel sentinel 156b9cf1623563ffdd2037b96eb23961b489654b 10.11.42.170 26379 @ mymaster 10.11.42.170 6379
4712:X 09 Jul 08:55:58.202 * +sentinel-address-switch master mymaster 10.11.42.170 6379 ip 10.11.42.170 port 26379 for c4ba2e747ab7f3b596901fbae6c6fd47fe2ff724
4712:X 09 Jul 08:55:59.574 # +new-epoch 68
4712:X 09 Jul 08:55:59.576 # +vote-for-leader 3af8da37608533f2b0247a69afecbefef714ca38 68
4712:X 09 Jul 08:55:59.610 # Next failover delay: I will not start a failover before Tue Jul 9 08:56:30 2019
4712:X 09 Jul 08:56:00.527 # +config-update-from sentinel 3af8da37608533f2b0247a69afecbefef714ca38 10.11.42.168 26379 @ mymaster 10.11.42.170 6379
4712:X 09 Jul 08:56:00.527 # +switch-master mymaster 10.11.42.170 6379 10.11.42.169 6379
4712:X 09 Jul 08:56:00.527 * +slave slave 10.11.42.168:6379 10.11.42.168 6379 @ mymaster 10.11.42.169 6379
4712:X 09 Jul 08:56:00.527 * +slave slave 10.11.42.170:6379 10.11.42.170 6379 @ mymaster 10.11.42.169 6379


sentinel.conf

Greg Andrews

unread,
Sep 11, 2019, 10:29:09 AM9/11/19
to Redis DB
You have multiple Sentinels defined, but show the log for only one of them.  That log suggests a loss of connectivity from some of the Sentinels to the designated master, so a failover was requested.  Have you compared the logs of the Sentinels to confirm/deny that some of them encounter trouble talking to the master while others don't?

Saurav Mondal

unread,
Nov 28, 2019, 4:57:11 AM11/28/19
to Redis DB
Hi Greg,

I have attached all the logs from all three nodes. sentinel logs as well. You can see starting from time around 2019-05-28T08:36:30, it appears there is conflict in leader election. They can't decide for 20 mins. Next failover delay: I will not start a failover before is printed repeatedly by all. Then at the end around 2019-05-28T08:56:30, 10.11.42.170 becomes new master causing 20 mins of down time. Do you see any problem in our sentinel configuration ? Do you need any other info for analysis?

Best regards,
Saurav
10.11.42.170_sentinel
10.11.42.170_redis
10.11.42.169_sentinel
10.11.42.169_redis
10.11.42.168_redis
10.11.42.168_sentinel

Ankit Gupta

unread,
Aug 5, 2022, 8:03:24 AMAug 5
to Redis DB
Hi,

Were you able to find root cause for this issue ?

Reply all
Reply to author
Forward
0 new messages