We recently had a production issue where 3 nodes (master) were kicked out of cluster suddenly out of cluster.
We are running Redis 3.2.4 and 9 masters and 9 slaves on AWS. Instance type: [r4.xlarge and r5.xlarge]
This cluster is big enough to handle our 100k /s IOPS.
On that day, when the traffic increased to 30k IOPS / s, 3 nodes were kicked out of cluster
19:10:37.612 # Cluster state changed: fail redis_10.200.100.85.log:21636:M
19:10:37.802 # Cluster state changed: fail redis_10.200.5.228.log:1205:M
19:10:42.613 # Cluster state changed: ok redis_10.200.100.85.log:21636:M
19:10:42.803 # Cluster state changed: ok redis_10.200.5.228.log:1205:M
19:10:50.734 # Cluster state changed: fail redis_10.200.100.85.log:21636:M
19:10:55.742 # Cluster state changed: ok redis_10.200.100.85.log:21636:M
19:11:05.036 # Cluster state changed: fail redis_10.200.100.85.log:21636:M
19:11:10.645 # Cluster state changed: ok redis_10.200.100.85.log:21636:M
19:11:17.481 # Cluster state changed: fail redis_10.200.100.99.log:34158:M
19:11:19.263 # Cluster state changed: fail redis_10.200.100.85.log:21636:M
19:11:22.533 # Cluster state changed: ok redis_10.200.100.99.log:34158:M
19:11:26.488 # Cluster state changed: ok redis_10.200.100.85.log:21636:M
19:11:27.280 # Cluster state changed: fail redis_10.200.5.117.log:16009:M
19:11:32.280 # Cluster state changed: ok redis_10.200.5.117.log:16009:M
19:11:34.838 # Cluster state changed: fail redis_10.200.100.72.log:22084:M
19:11:34.975 # Cluster state changed: fail redis_10.200.100.99.log:34158:M
19:11:36.088 # Cluster state changed: fail redis_10.200.100.85.log:21636:M
19:11:40.009 # Cluster state changed: ok redis_10.200.100.99.log:34158:M
19:11:41.078 # Cluster state changed: ok redis_10.200.100.72.log:22084:M
19:11:41.823 * Marking node 5440860bf0c56106f22384be3bc7f74eccc318ca as failing (quorum reached). redis_10.200.2.171.log:1200:S
19:11:41.824 # Cluster state changed: fail redis_10.200.2.171.log:1200:S
19:11:41.877 # Start of election delayed for 602 milliseconds (rank #0, offset 2753804932435). redis_10.200.2.171.log:1200:S
19:11:42.454 # Cluster state changed: ok redis_10.200.100.85.log:21636:M
19:11:42.479 # Starting a failover election for epoch 89. redis_10.200.2.171.log:1200:S
19:11:42.523 # Failover auth denied to 127fc5b043bd75068aebd810c40673cb514859f7: its master is up redis_10.200.100.85.log:21636:M
19:11:43.187 # Failover auth denied to 127fc5b043bd75068aebd810c40673cb514859f7: its master is up redis_10.200.2.235.log:1200:M
19:11:43.254 # Failover auth denied to 127fc5b043bd75068aebd810c40673cb514859f7: its master is up redis_10.200.5.117.log:16009:M
19:11:43.881 # Failover auth denied to 127fc5b043bd75068aebd810c40673cb514859f7: its master is up redis_10.200.100.58.log:22037:M
19:11:44.231 # Failover auth denied to 127fc5b043bd75068aebd810c40673cb514859f7: its master is up redis_10.200.5.228.log:1205:M
19:11:44.387 # Failover auth denied to 127fc5b043bd75068aebd810c40673cb514859f7: its master is up redis_10.200.101.90.log:35522:M
19:11:44.421 # Failover auth denied to 127fc5b043bd75068aebd810c40673cb514859f7: its master is up redis_10.200.100.194.log:10116:M
19:11:44.988 # Failover auth denied to 127fc5b043bd75068aebd810c40673cb514859f7: its master is up redis_10.200.100.99.log:34158:M
19:11:46.954 # Failover auth denied to 127fc5b043bd75068aebd810c40673cb514859f7: its master is up redis_10.200.100.72.log:22084:M
19:11:47.899 # Cluster state changed: fail redis_10.200.5.228.log:1205:M
19:11:48.899 # Cluster state changed: fail redis_10.200.100.187.log:22016:S
19:11:48.951 # Cluster state changed: fail redis_10.200.101.179.log:2261:S
19:11:49.067 # Cluster state changed: fail redis_10.200.100.99.log:34158:M
19:11:49.518 # Cluster state changed: ok redis_10.200.101.179.log:2261:S
19:11:50.074 # Cluster state changed: ok redis_10.200.100.187.log:22016:S
19:11:50.083 # Cluster state changed: fail redis_10.200.100.72.log:22084:M
19:11:51.455 # Cluster state changed: fail redis_10.200.100.85.log:21636:M
19:11:51.643 * Marking node a8f9373511a75b9c97c1a6ffe017c31ca8725a48 as failing (quorum reached). redis_10.200.100.55.log:21922:S
19:11:51.644 # Cluster state changed: fail redis_10.200.100.55.log:21922:S
19:11:51.893 # Currently unable to failover: Waiting for votes, but majority still not reached. redis_10.200.2.171.log:1200:S
19:11:52.006 # Cluster state changed: fail redis_10.200.101.90.log:35522:M
19:11:52.495 # Currently unable to failover: Failover attempt expired. redis_10.200.2.171.log:1200:S
19:11:52.903 # Cluster state changed: ok redis_10.200.5.228.log:1205:M
19:11:53.038 # Cluster state changed: fail redis_10.200.100.58.log:22037:M
19:11:53.038 * Marking node 5440860bf0c56106f22384be3bc7f74eccc318ca as failing (quorum reached). redis_10.200.100.58.log:22037:M
19:11:53.040 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.101.90.log:35522:M
19:11:53.140 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.100.85.log:21636:M
19:11:53.141 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.101.8.log:8114:S
19:11:53.142 # Cluster state changed: fail redis_10.200.101.8.log:8114:S
19:11:53.146 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.100.88.log:21852:S
19:11:53.147 # Cluster state changed: fail redis_10.200.100.88.log:21852:S
19:11:53.152 # Cluster state changed: fail redis_10.200.3.31.log:15815:S
19:11:53.152 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.100.99.log:34158:M
19:11:53.152 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.3.31.log:15815:S
19:11:53.155 # Cluster state changed: fail redis_10.200.101.179.log:2261:S
19:11:53.155 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.100.72.log:22084:M
19:11:53.155 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.101.179.log:2261:S
19:11:53.157 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.100.55.log:21922:S
19:11:53.158 # Cluster state changed: fail redis_10.200.5.91.log:15823:S
19:11:53.158 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.5.91.log:15823:S
19:11:53.165 # Cluster state changed: fail redis_10.200.100.187.log:22016:S
19:11:53.165 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.100.187.log:22016:S
19:11:53.167 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.100.49.log:646447:S
19:11:53.168 # Cluster state changed: fail redis_10.200.100.49.log:646447:S
19:11:53.589 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.2.235.log:1200:M
19:11:53.590 # Cluster state changed: fail redis_10.200.2.235.log:1200:M
19:11:54.844 * Clear FAIL state for node 5440860bf0c56106f22384be3bc7f74eccc318ca: is reachable again and nobody is serving its slots after some time. redis_10.200.2.171.log:1200:S
19:11:54.844 # Cluster state changed: ok redis_10.200.2.171.log:1200:S
19:11:55.471 * FAIL message received from f24444696fb319fb6c85dd9e8e01d9b182852070 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.5.228.log:1205:M
19:11:55.514 # Cluster state changed: fail redis_10.200.5.228.log:1205:M
19:12:02.424 * Clear FAIL state for node a8f9373511a75b9c97c1a6ffe017c31ca8725a48: is reachable again and nobody is serving its slots after some time. redis_10.200.100.55.log:21922:S
19:12:03.212 * Clear FAIL state for node 5440860bf0c56106f22384be3bc7f74eccc318ca: is reachable again and nobody is serving its slots after some time. redis_10.200.101.179.log:2261:S
19:12:03.212 # Cluster state changed: ok redis_10.200.101.179.log:2261:S
19:12:03.218 * Clear FAIL state for node 5440860bf0c56106f22384be3bc7f74eccc318ca: is reachable again and nobody is serving its slots after some time. redis_10.200.3.31.log:15815:S
19:12:03.219 * Clear FAIL state for node 5440860bf0c56106f22384be3bc7f74eccc318ca: is reachable again and nobody is serving its slots after some time. redis_10.200.5.91.log:15823:S
19:12:03.219 # Cluster state changed: ok redis_10.200.3.31.log:15815:S
19:12:03.220 * Clear FAIL state for node 5440860bf0c56106f22384be3bc7f74eccc318ca: is reachable again and nobody is serving its slots after some time. redis_10.200.100.187.log:22016:S
19:12:03.220 # Cluster state changed: ok redis_10.200.100.187.log:22016:S
19:12:03.220 # Cluster state changed: ok redis_10.200.5.91.log:15823:S
19:12:03.223 * Clear FAIL state for node 5440860bf0c56106f22384be3bc7f74eccc318ca: is reachable again and nobody is serving its slots after some time. redis_10.200.100.55.log:21922:S
19:12:03.223 # Cluster state changed: ok redis_10.200.100.55.log:21922:S
19:12:04.637 * Clear FAIL state for node 5440860bf0c56106f22384be3bc7f74eccc318ca: is reachable again and nobody is serving its slots after some time. redis_10.200.100.99.log:34158:M
19:12:04.638 # Cluster state changed: ok redis_10.200.100.99.log:34158:M
19:12:04.639 * Clear FAIL state for node 5440860bf0c56106f22384be3bc7f74eccc318ca: is reachable again and nobody is serving its slots after some time. redis_10.200.100.72.log:22084:M
19:12:04.639 # Cluster state changed: ok redis_10.200.100.72.log:22084:M
19:12:04.641 * Clear FAIL state for node 5440860bf0c56106f22384be3bc7f74eccc318ca: is reachable again and nobody is serving its slots after some time. redis_10.200.2.235.log:1200:M
19:12:04.648 # Cluster state changed: ok redis_10.200.2.235.log:1200:M
19:12:04.752 # Cluster state changed: fail redis_10.200.2.171.log:1200:S
19:12:04.752 * FAIL message received from 93a628522165203a7ac85a35550879a66fe92142 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.2.171.log:1200:S
19:12:04.754 * Marking node 5440860bf0c56106f22384be3bc7f74eccc318ca as failing (quorum reached). redis_10.200.100.194.log:10116:M
19:12:04.762 # Cluster state changed: fail redis_10.200.100.194.log:10116:M
19:12:04.769 * FAIL message received from 93a628522165203a7ac85a35550879a66fe92142 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.5.91.log:15823:S
19:12:04.770 # Cluster state changed: fail redis_10.200.5.91.log:15823:S
19:12:04.817 # Start of election delayed for 754 milliseconds (rank #0, offset 2753806750215). redis_10.200.2.171.log:1200:S
19:12:04.870 # Cluster state changed: fail redis_10.200.101.179.log:2261:S
19:12:04.870 * FAIL message received from 93a628522165203a7ac85a35550879a66fe92142 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.101.179.log:2261:S
19:12:04.875 # Cluster state changed: fail redis_10.200.100.55.log:21922:S
19:12:04.875 * FAIL message received from 93a628522165203a7ac85a35550879a66fe92142 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.100.55.log:21922:S
19:12:04.880 # Cluster state changed: fail redis_10.200.100.187.log:22016:S
19:12:04.880 * FAIL message received from 93a628522165203a7ac85a35550879a66fe92142 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.100.187.log:22016:S
19:12:04.881 # Cluster state changed: fail redis_10.200.3.31.log:15815:S
19:12:04.881 * FAIL message received from 93a628522165203a7ac85a35550879a66fe92142 about 5440860bf0c56106f22384be3bc7f74eccc318ca redis_10.200.3.31.log:15815:S
We did not see much CPU / memory usage at that time.
Does anyone know if there is a known issue in this redis version? We are running multiple later versions of redis clusters and do not see any issue at much higher workloads.