3 nodes of our Redis Cluster were suddenly kicked out of cluster

34 views
Skip to first unread message

Madhur Ahuja

unread,
May 21, 2022, 7:34:03 AMMay 21
to Redis DB
Hi Team

We recently had a production issue where 3 nodes (master) were kicked out of cluster suddenly out of cluster.

We are running Redis 3.2.4 and 9 masters and 9 slaves on AWS. Instance type: [r4.xlarge and r5.xlarge]

This cluster is big enough to handle our 100k /s IOPS.

On that day, when the traffic increased to 30k IOPS / s, 3 nodes were kicked out of cluster

Here is the log of multiple clusters in sorted of time:

19:10:37.612 # Cluster state changed: fail redis_10.200.100.85.log:21636:M
19:10:37.802 # Cluster state changed: fail redis_10.200.5.228.log:1205:M
19:10:42.613 # Cluster state changed: ok redis_10.200.100.85.log:21636:M
19:10:42.803 # Cluster state changed: ok redis_10.200.5.228.log:1205:M
19:10:50.734 # Cluster state changed: fail redis_10.200.100.85.log:21636:M
19:10:55.742 # Cluster state changed: ok redis_10.200.100.85.log:21636:M
19:11:05.036 # Cluster state changed: fail redis_10.200.100.85.log:21636:M
19:11:10.645 # Cluster state changed: ok redis_10.200.100.85.log:21636:M
19:11:17.481 # Cluster state changed: fail redis_10.200.100.99.log:34158:M
19:11:19.263 # Cluster state changed: fail redis_10.200.100.85.log:21636:M
19:11:22.533 # Cluster state changed: ok redis_10.200.100.99.log:34158:M
19:11:26.488 # Cluster state changed: ok redis_10.200.100.85.log:21636:M
19:11:27.280 # Cluster state changed: fail redis_10.200.5.117.log:16009:M
19:11:32.280 # Cluster state changed: ok redis_10.200.5.117.log:16009:M
19:11:34.838 # Cluster state changed: fail redis_10.200.100.72.log:22084:M
19:11:34.975 # Cluster state changed: fail redis_10.200.100.99.log:34158:M
19:11:36.088 # Cluster state changed: fail redis_10.200.100.85.log:21636:M
19:11:40.009 # Cluster state changed: ok redis_10.200.100.99.log:34158:M
19:11:41.078 # Cluster state changed: ok redis_10.200.100.72.log:22084:M
19:11:41.823 * Marking node 5440860bf0c56106f22384be3bc7f74eccc318ca as failing (quorum reached). redis_10.200.2.171.log:1200:S
19:11:41.824 # Cluster state changed: fail redis_10.200.2.171.log:1200:S
19:11:41.877 # Start of election delayed for 602 milliseconds (rank #0, offset 2753804932435). redis_10.200.2.171.log:1200:S
19:11:42.454 # Cluster state changed: ok redis_10.200.100.85.log:21636:M
19:11:42.479 # Starting a failover election for epoch 89. redis_10.200.2.171.log:1200:S
19:11:42.523 # Failover auth denied to 127fc5b043bd75068aebd810c40673cb514859f7: its master is up redis_10.200.100.85.log:21636:M
19:11:43.187 # Failover auth denied to 127fc5b043bd75068aebd810c40673cb514859f7: its master is up redis_10.200.2.235.log:1200:M
19:11:43.254 # Failover auth denied to 127fc5b043bd75068aebd810c40673cb514859f7: its master is up redis_10.200.5.117.log:16009:M
19:11:43.881 # Failover auth denied to 127fc5b043bd75068aebd810c40673cb514859f7: its master is up redis_10.200.100.58.log:22037:M
19:11:44.231 # Failover auth denied to 127fc5b043bd75068aebd810c40673cb514859f7: its master is up redis_10.200.5.228.log:1205:M
19:11:44.387 # Failover auth denied to 127fc5b043bd75068aebd810c40673cb514859f7: its master is up redis_10.200.101.90.log:35522:M
19:11:44.421 # Failover auth denied to 127fc5b043bd75068aebd810c40673cb514859f7: its master is up redis_10.200.100.194.log:10116:M
19:11:44.988 # Failover auth denied to 127fc5b043bd75068aebd810c40673cb514859f7: its master is up redis_10.200.100.99.log:34158:M
19:11:46.954 # Failover auth denied to 127fc5b043bd75068aebd810c40673cb514859f7: its master is up redis_10.200.100.72.log:22084:M
19:11:47.899 # Cluster state changed: fail redis_10.200.5.228.log:1205:M
19:11:48.899 # Cluster state changed: fail redis_10.200.100.187.log:22016:S
19:11:48.951 # Cluster state changed: fail redis_10.200.101.179.log:2261:S
19:11:49.067 # Cluster state changed: fail redis_10.200.100.99.log:34158:M
19:11:49.518 # Cluster state changed: ok redis_10.200.101.179.log:2261:S
19:11:50.074 # Cluster state changed: ok redis_10.200.100.187.log:22016:S
19:11:50.083 # Cluster state changed: fail redis_10.200.100.72.log:22084:M
19:11:51.455 # Cluster state changed: fail redis_10.200.100.85.log:21636:M
19:11:51.643 * Marking node a8f9373511a75b9c97c1a6ffe017c31ca8725a48 as failing (quorum reached). redis_10.200.100.55.log:21922:S
19:11:51.644 # Cluster state changed: fail redis_10.200.100.55.log:21922:S
19:11:51.893 # Currently unable to failover: Waiting for votes, but majority still not reached. redis_10.200.2.171.log:1200:S
19:11:52.006 # Cluster state changed: fail redis_10.200.101.90.log:35522:M
19:11:52.495 # Currently unable to failover: Failover attempt expired. redis_10.200.2.171.log:1200:S
19:11:52.903 # Cluster state changed: ok redis_10.200.5.228.log:1205:M
19:11:53.038 # Cluster state changed: fail redis_10.200.100.58.log:22037:M
19:11:53.038 * Marking node 5440860bf0c56106f22384be3bc7f74eccc318ca as failing (quorum reached). redis_10.200.100.58.log:22037:M
19:11:53.040 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.101.90.log:35522:M
19:11:53.140 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.100.85.log:21636:M
19:11:53.141 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.101.8.log:8114:S
19:11:53.142 # Cluster state changed: fail redis_10.200.101.8.log:8114:S
19:11:53.146 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.100.88.log:21852:S
19:11:53.147 # Cluster state changed: fail redis_10.200.100.88.log:21852:S
19:11:53.152 # Cluster state changed: fail redis_10.200.3.31.log:15815:S
19:11:53.152 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.100.99.log:34158:M
19:11:53.152 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.3.31.log:15815:S
19:11:53.155 # Cluster state changed: fail redis_10.200.101.179.log:2261:S
19:11:53.155 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.100.72.log:22084:M
19:11:53.155 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.101.179.log:2261:S
19:11:53.157 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.100.55.log:21922:S
19:11:53.158 # Cluster state changed: fail redis_10.200.5.91.log:15823:S
19:11:53.158 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.5.91.log:15823:S
19:11:53.165 # Cluster state changed: fail redis_10.200.100.187.log:22016:S
19:11:53.165 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.100.187.log:22016:S
19:11:53.167 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.100.49.log:646447:S
19:11:53.168 # Cluster state changed: fail redis_10.200.100.49.log:646447:S
19:11:53.589 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.2.235.log:1200:M
19:11:53.590 # Cluster state changed: fail redis_10.200.2.235.log:1200:M
19:11:54.844 * Clear FAIL state for node 5440860bf0c56106f22384be3bc7f74eccc318ca: is reachable again and nobody is serving its slots after some time. redis_10.200.2.171.log:1200:S
19:11:54.844 # Cluster state changed: ok redis_10.200.2.171.log:1200:S
19:11:55.471 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.5.228.log:1205:M
19:11:55.514 # Cluster state changed: fail redis_10.200.5.228.log:1205:M
19:12:02.424 * Clear FAIL state for node a8f9373511a75b9c97c1a6ffe017c31ca8725a48: is reachable again and nobody is serving its slots after some time. redis_10.200.100.55.log:21922:S
19:12:03.212 * Clear FAIL state for node 5440860bf0c56106f22384be3bc7f74eccc318ca: is reachable again and nobody is serving its slots after some time. redis_10.200.101.179.log:2261:S
19:12:03.212 # Cluster state changed: ok redis_10.200.101.179.log:2261:S
19:12:03.218 * Clear FAIL state for node 5440860bf0c56106f22384be3bc7f74eccc318ca: is reachable again and nobody is serving its slots after some time. redis_10.200.3.31.log:15815:S
19:12:03.219 * Clear FAIL state for node 5440860bf0c56106f22384be3bc7f74eccc318ca: is reachable again and nobody is serving its slots after some time. redis_10.200.5.91.log:15823:S
19:12:03.219 # Cluster state changed: ok redis_10.200.3.31.log:15815:S
19:12:03.220 * Clear FAIL state for node 5440860bf0c56106f22384be3bc7f74eccc318ca: is reachable again and nobody is serving its slots after some time. redis_10.200.100.187.log:22016:S
19:12:03.220 # Cluster state changed: ok redis_10.200.100.187.log:22016:S
19:12:03.220 # Cluster state changed: ok redis_10.200.5.91.log:15823:S
19:12:03.223 * Clear FAIL state for node 5440860bf0c56106f22384be3bc7f74eccc318ca: is reachable again and nobody is serving its slots after some time. redis_10.200.100.55.log:21922:S
19:12:03.223 # Cluster state changed: ok redis_10.200.100.55.log:21922:S
19:12:04.637 * Clear FAIL state for node 5440860bf0c56106f22384be3bc7f74eccc318ca: is reachable again and nobody is serving its slots after some time. redis_10.200.100.99.log:34158:M
19:12:04.638 # Cluster state changed: ok redis_10.200.100.99.log:34158:M
19:12:04.639 * Clear FAIL state for node 5440860bf0c56106f22384be3bc7f74eccc318ca: is reachable again and nobody is serving its slots after some time. redis_10.200.100.72.log:22084:M
19:12:04.639 # Cluster state changed: ok redis_10.200.100.72.log:22084:M
19:12:04.641 * Clear FAIL state for node 5440860bf0c56106f22384be3bc7f74eccc318ca: is reachable again and nobody is serving its slots after some time. redis_10.200.2.235.log:1200:M
19:12:04.648 # Cluster state changed: ok redis_10.200.2.235.log:1200:M
19:12:04.752 # Cluster state changed: fail redis_10.200.2.171.log:1200:S
19:12:04.752 * FAIL message received from 93a628522165203a7ac85a35550879a66fe92142 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.2.171.log:1200:S
19:12:04.754 * Marking node 5440860bf0c56106f22384be3bc7f74eccc318ca as failing (quorum reached). redis_10.200.100.194.log:10116:M
19:12:04.762 # Cluster state changed: fail redis_10.200.100.194.log:10116:M
19:12:04.769 * FAIL message received from 93a628522165203a7ac85a35550879a66fe92142 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.5.91.log:15823:S
19:12:04.770 # Cluster state changed: fail redis_10.200.5.91.log:15823:S
19:12:04.817 # Start of election delayed for 754 milliseconds (rank #0, offset 2753806750215). redis_10.200.2.171.log:1200:S
19:12:04.870 # Cluster state changed: fail redis_10.200.101.179.log:2261:S
19:12:04.870 * FAIL message received from 93a628522165203a7ac85a35550879a66fe92142 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.101.179.log:2261:S
19:12:04.875 # Cluster state changed: fail redis_10.200.100.55.log:21922:S
19:12:04.875 * FAIL message received from 93a628522165203a7ac85a35550879a66fe92142 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.100.55.log:21922:S
19:12:04.880 # Cluster state changed: fail redis_10.200.100.187.log:22016:S
19:12:04.880 * FAIL message received from 93a628522165203a7ac85a35550879a66fe92142 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.100.187.log:22016:S
19:12:04.881 # Cluster state changed: fail redis_10.200.3.31.log:15815:S
19:12:04.881 * FAIL message received from 93a628522165203a7ac85a35550879a66fe92142 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.3.31.log:15815:S


We did not see much CPU / memory usage at that time.

Does anyone know if there is a known issue in this redis version? We are running multiple later versions of redis clusters and do not see any issue  at much higher workloads.

Regards,
Madhur
humanReadable.txt
Reply all
Reply to author
Forward
0 new messages