Redis with IPsec, sentinels do not promote slave when IPsec is flushed

108 views
Skip to first unread message

Rob

unread,
Apr 5, 2016, 9:44:03 AM4/5/16
to Redis DB
I have three redis servers, running redis 3.0.5, each one is also running a redis sentinel. The comms between the servers is encrypted with IPsec. When all IPsec connections are working, then stopping the redis replication master is recognised by the sentinels and they promote a slave to be the new master.

But when I flush IPsec (using setkey -F and setkey -FP) then the comms between the master and slaves is broken, but the sentinels do NOT promote a slave to become a new master.

The redis slaves recognise the fact that their master appears to be unavailable, because I can see this type of stuff in the redis server log:

11002:S 04 Apr 15:49:49.258 * Connecting to MASTER 192.168.106.246:6379
11002:S 04 Apr 15:49:49.259 * MASTER <-> SLAVE sync started
11002:S 04 Apr 15:50:50.425 # Timeout connecting to the MASTER...


If I subsequently activate IPsec again on the redis master, then the replication catches up, and the master remains as the master.

But while IPsec is flushed, then updates on the master can happen and they are not replicated to slaves.

Is there a setting for the sentinels that will recognise that the master is down?

Jan-Erik Rediger

unread,
Apr 5, 2016, 9:51:24 AM4/5/16
to redi...@googlegroups.com
Can you show the log of your sentinels?
Do the sentinels see the slaves as down?
> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
> To post to this group, send email to redi...@googlegroups.com.
> Visit this group at https://groups.google.com/group/redis-db.
> For more options, visit https://groups.google.com/d/optout.

Rob

unread,
Apr 5, 2016, 10:49:01 AM4/5/16
to Redis DB
This appears when I flush IPsec on the master

2346:X 05 Apr 15:44:19.441 # +sdown sentinel 192.168.106.246:26379 192.168.106.246 26379 @ mymaster 192.168.106.246 6379
2346:X 05 Apr 15:44:19.503 # +sdown master mymaster 192.168.106.246 6379
2346:X 05 Apr 15:44:20.124 # +new-epoch 279
2346:X 05 Apr 15:44:20.126 # +vote-for-leader dde9619555329198dfc948eb64a20aea9f6c7a8e 279
2346:X 05 Apr 15:44:20.634 # +odown master mymaster 192.168.106.246 6379 #quorum 2/2
2346:X 05 Apr 15:44:20.634 # Next failover delay: I will not start a failover before Tue Apr  5 15:44:24 2016
2346:X 05 Apr 15:44:24.621 # +new-epoch 280
2346:X 05 Apr 15:44:24.624 # +vote-for-leader dde9619555329198dfc948eb64a20aea9f6c7a8e 280
2346:X 05 Apr 15:44:24.626 # Next failover delay: I will not start a failover before Tue Apr  5 15:44:29 2016
2346:X 05 Apr 15:44:29.479 # +new-epoch 281
2346:X 05 Apr 15:44:29.482 # +vote-for-leader dde9619555329198dfc948eb64a20aea9f6c7a8e 281
2346:X 05 Apr 15:44:29.484 # Next failover delay: I will not start a failover before Tue Apr  5 15:44:33 2016
2346:X 05 Apr 15:44:33.946 # +new-epoch 282
2346:X 05 Apr 15:44:33.947 # +try-failover master mymaster 192.168.106.246 6379
2346:X 05 Apr 15:44:33.950 # +vote-for-leader f07d97e2b78c9dfe6e1d28a5590ecff43b443428 282
2346:X 05 Apr 15:44:33.957 # 192.168.106.251:26379 voted for f07d97e2b78c9dfe6e1d28a5590ecff43b443428 282
2346:X 05 Apr 15:44:35.982 # -failover-abort-not-elected master mymaster 192.168.106.246 6379
2346:X 05 Apr 15:44:36.034 # Next failover delay: I will not start a failover before Tue Apr  5 15:44:37 2016
2346:X 05 Apr 15:44:37.980 # +new-epoch 283
2346:X 05 Apr 15:44:37.980 # +try-failover master mymaster 192.168.106.246 6379
2346:X 05 Apr 15:44:37.984 # +vote-for-leader f07d97e2b78c9dfe6e1d28a5590ecff43b443428 283
2346:X 05 Apr 15:44:37.991 # 192.168.106.251:26379 voted for f07d97e2b78c9dfe6e1d28a5590ecff43b443428 283
2346:X 05 Apr 15:44:40.261 # -failover-abort-not-elected master mymaster 192.168.106.246 6379
2346:X 05 Apr 15:44:40.345 # Next failover delay: I will not start a failover before Tue Apr  5 15:44:42 2016
2346:X 05 Apr 15:44:42.240 # +new-epoch 284
2346:X 05 Apr 15:44:42.240 # +try-failover master mymaster 192.168.106.246 6379
2346:X 05 Apr 15:44:42.244 # +vote-for-leader f07d97e2b78c9dfe6e1d28a5590ecff43b443428 284
2346:X 05 Apr 15:44:42.251 # 192.168.106.251:26379 voted for f07d97e2b78c9dfe6e1d28a5590ecff43b443428 284
2346:X 05 Apr 15:44:44.891 # -failover-abort-not-elected master mymaster 192.168.106.246 6379
2346:X 05 Apr 15:44:44.975 # Next failover delay: I will not start a failover before Tue Apr  5 15:44:46 2016
2346:X 05 Apr 15:44:46.886 # +new-epoch 285
2346:X 05 Apr 15:44:46.889 # +vote-for-leader dde9619555329198dfc948eb64a20aea9f6c7a8e 285
2346:X 05 Apr 15:44:46.902 # Next failover delay: I will not start a failover before Tue Apr  5 15:44:51 2016
2346:X 05 Apr 15:44:51.273 # +new-epoch 286
2346:X 05 Apr 15:44:51.276 # +vote-for-leader dde9619555329198dfc948eb64a20aea9f6c7a8e 286
2346:X 05 Apr 15:44:51.337 # Next failover delay: I will not start a failover before Tue Apr  5 15:44:55 2016


Then I set the IPsec tables again, and the old master was available again. Both slaves then seem to vote for it to become master again:

2346:X 05 Apr 15:44:55.289 # +new-epoch 287
2346:X 05 Apr 15:44:55.290 # +try-failover master mymaster 192.168.106.246 6379
2346:X 05 Apr 15:44:55.293 # +vote-for-leader f07d97e2b78c9dfe6e1d28a5590ecff43b443428 287
2346:X 05 Apr 15:44:55.300 # 192.168.106.251:26379 voted for f07d97e2b78c9dfe6e1d28a5590ecff43b443428 287
2346:X 05 Apr 15:44:55.731 # 192.168.106.246:26379 voted for f07d97e2b78c9dfe6e1d28a5590ecff43b443428 287
2346:X 05 Apr 15:44:55.776 # +elected-leader master mymaster 192.168.106.246 6379
2346:X 05 Apr 15:44:55.776 # +failover-state-select-slave master mymaster 192.168.106.246 6379
2346:X 05 Apr 15:44:55.776 # -sdown sentinel 192.168.106.246:26379 192.168.106.246 26379 @ mymaster 192.168.106.246 6379
2346:X 05 Apr 15:44:55.842 # +selected-slave slave 192.168.106.247:6379 192.168.106.247 6379 @ mymaster 192.168.106.246 6379
2346:X 05 Apr 15:44:55.843 * +failover-state-send-slaveof-noone slave 192.168.106.247:6379 192.168.106.247 6379 @ mymaster 192.168.106.246 6379
2346:X 05 Apr 15:44:55.895 * +failover-state-wait-promotion slave 192.168.106.247:6379 192.168.106.247 6379 @ mymaster 192.168.106.246 6379

The Real Bill

unread,
Apr 6, 2016, 1:35:07 AM4/6/16
to Redis DB
The sentinel still have to be able to talk to each other. By removing this ability you are removing their ability to elect a leader to perform a failover. You can see this in the log by it reporting it had to abort because it wasn't elected leader.

I would recommend trying this when the sentinels are not on the same nodes as the Redis servers they manage. You may also be hitting a bug known to exist when running sentinels on the Redis nodes.

Rob

unread,
Apr 11, 2016, 9:32:10 AM4/11/16
to Redis DB
Thanks. I added another sentinel and this solved the problem. I expect it must be that when one of the original sentinels became inaccessible, then the system was reduced to two sentinels - which is specifically mentioned as no good in the sentinel documentation. So when the extra sentinel was added, the three out of four could vote up a slave to become the new replication master.


Incidentally - what is the "bug known to exist when running sentinels on the Redis nodes" that you mention?
Reply all
Reply to author
Forward
0 new messages