Previously we had a cluster running with 32 redis nodes (16 masters, 16 slaves) split evenly between 4 machines. Afterwards we booted up a new machine and started up 8 nodes, and added them to the cluster as replicas. Up to this point everything was OK. Afterwards, we promoted them to masters by killing any slaves that were connected to the same master as them, and then killed the master. We then restarted the nodes that we killed so that they would turn into slaves. At this point cluster went down and we were unable to query it.
When we do cluster info and cluster nodes on a node on any of the 4 old machines everything looks ok. Machine 10.97.153.90 is the new machine:
./redis-cli -p 6379 cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:43
cluster_size:16
cluster_current_epoch:58
cluster_stats_messages_sent:2405700
cluster_stats_messages_received:237245
./redis-cli -p 6379 cluster nodes
358e88aaa180df3c14af70324a15b05499a6f81c
10.6.213.15:6379 master - 0 1403130264076 58 connected 3072-4095
22f38ad8afc3794ad43454d886abea198030fcfb
10.97.153.90:6380 master - 0 1403130264277 50 connected 1024-2047
ba27bda55a589f587f189988f7f26158147eb6df
10.104.123.245:6383 slave 22f38ad8afc3794ad43454d886abea198030fcfb 0 1403130263472 50 connected
6c48fc68903672be03920f63d9929c25e700ff67
10.189.117.196:6384 master - 0 1403130262371 10 connected 9216-10239
14e2d217d3d3db1e28a6b5c0a4b85bbb49fc2204
10.189.121.71:6385 slave d24d256eaec99d0c7de1b50e7ac6988fa146950a 0 1403130263676 54 connected
d24d256eaec99d0c7de1b50e7ac6988fa146950a
10.97.153.90:6383 master - 0 1403130262072 54 connected 12288-13311
ccef0a3b064069b291e25eb27d8303492c84fcd2
10.6.213.15:6380 master - 0 1403130263775 19 connected 7168-8191
52e3ea6e35a2c1466083f423167ce144d083ae31
10.104.123.245:6380 master - 0 1403130262270 43 connected 4096-5119
db7aeaac383544ff2c4380ad8c69a082501b6c2c
10.189.121.71:6380 slave 73f41bbdb091b5f6cfe51f48e9b115b0410f01ef 0 1403130262272 53 connected
3205d0677913b5eb2db6aa70799387f2a9ee278b
10.6.213.15:6381 slave d75e5ad9d0acbd2586befc7bae0039d9cc789ecd 0 1403130263876 56 connected
d5e61834b27581515f725bab874d78da6788d92f
10.189.117.196:6379 master - 0 1403130263273 39 connected 2048-3071
b5051325dff7f454846ee174102b73eb343e746c
10.189.121.71:6379 slave 22f38ad8afc3794ad43454d886abea198030fcfb 0 1403130264280 50 connected
ef0a9726b7249929e9f07712f8421d4462c5ac45
10.97.153.90:6379 slave 358e88aaa180df3c14af70324a15b05499a6f81c 0 1403130263675 58 connected
98ec422a5176e51e2cf6b54af54def067f325331
10.189.121.71:6381 slave 6c48fc68903672be03920f63d9929c25e700ff67 0 1403130261872 10 connected
f4ecb51aab2326866f3118f9bb8e467c5803d5ec :0 master,fail,noaddr - 1403053880810 1403053880810 0 disconnected
80e8cda2b70a08651f7de5642f8a4b4260bbdb9a
10.104.123.245:6386 slave 73f41bbdb091b5f6cfe51f48e9b115b0410f01ef 0 1403130263272 53 connected
73f41bbdb091b5f6cfe51f48e9b115b0410f01ef
10.97.153.90:6382 master - 0 1403130261972 53 connected 5120-6143
8a113f25261adc4f7efa68c06cbcd80720ccb18d
10.97.153.90:6384 master - 0 1403130262072 55 connected 0-1023
b90493db2bf13dc2bb7692c057bdb06f097355b8
10.104.123.245:6385 slave 358e88aaa180df3c14af70324a15b05499a6f81c 0 1403130263372 58 connected
9f44db168f05455b6f3cbf6fee541142f4e6cef1
10.189.121.71:6382 slave 5f17d0a9ae3ea43a1f9681c6f5002dff4b732fd0 0 1403130262171 42 connected
04860a24595e424be0be484812114b73967ce23e
10.189.121.71:6384 slave 52e3ea6e35a2c1466083f423167ce144d083ae31 0 1403130262572 43 connected
899c6e0419d857fa3ea16c49bada01404efbef99
10.104.123.245:6382 slave d24d256eaec99d0c7de1b50e7ac6988fa146950a 0 1403130263372 54 connected
02c776b1fd86ac51e7db1112ef67ad591077b90a
10.189.117.196:6382 slave 5f17d0a9ae3ea43a1f9681c6f5002dff4b732fd0 0 1403130262772 53 connected
808259e8b4034c134a3ebaee0412538520ef8a66 :0 master,fail,noaddr - 1403053880810 1403053880810 0 disconnected
4b60029245e5e1a500a3fb0181c06eb25864bf3c
10.6.213.15:6382 slave cae7cc5757c510707c1ba1854c634fe717cb8134 0 1403130262976 31 connected
6a39e0e52e9c2065d9246f490fd96acc9859f3b4
10.189.117.196:6385 slave 518a6c613d9ac8ec72ae121a0bf1c7ead6a2bdc1 0 1403130263476 57 connected
a0582038e8d5d8f88dd0b18d68eb77b8dc104b16 :0 myself,slave 8a113f25261adc4f7efa68c06cbcd80720ccb18d 0 0 44 connected
183b0438beb265ac7bdd0931f7788ed128745ef7
10.189.117.196:6386 slave 0be86252ed21078e3d9109ffcbfbaf201402e71e 0 1403130261970 51 connected
778a223b3489be8d446f67e18238bdac3e66881c
10.104.123.245:6384 slave d5e61834b27581515f725bab874d78da6788d92f 0 1403130264276 39 connected
5f17d0a9ae3ea43a1f9681c6f5002dff4b732fd0
10.6.213.15:6384 master - 0 1403130262272 42 connected 13312-14335
b1d312de0296fe69bca9888d39ba00bba512da4a
10.189.121.71:6383 slave 8a113f25261adc4f7efa68c06cbcd80720ccb18d 0 1403130263276 55 connected
e18501eda6f5b34f316caeb4675e177452d492b6
10.6.213.15:6383 slave bc4e0ed79b15f919523f227a27d8814d2dbf59db 0 1403130262572 15 connected
d75e5ad9d0acbd2586befc7bae0039d9cc789ecd
10.97.153.90:6385 master - 0 1403130262573 56 connected 11264-12287
518a6c613d9ac8ec72ae121a0bf1c7ead6a2bdc1
10.97.153.90:6386 master - 0 1403130263174 57 connected 14336-15359
9a70061d0d0d9438efc8af2eb31c9fb95863f94c
10.189.117.196:6380 slave 0be86252ed21078e3d9109ffcbfbaf201402e71e 0 1403130263676 51 connected
cae7cc5757c510707c1ba1854c634fe717cb8134
10.6.213.15:6386 master - 0 1403130263775 31 connected 15360-16383
2f874d0b518991a5562c2df50118ff3830d59a1b
10.104.123.245:6381 slave 161fffea420cf10c99a1abd1dcccb890ddf635aa 0 1403130263472 26 connected
bc4e0ed79b15f919523f227a27d8814d2dbf59db
10.189.117.196:6381 master - 0 1403130263576 15 connected 10240-11263
88be79fec7a3b12aa2e28e72a620ce78f684b36e :0 master,fail,noaddr - 1403053880810 1403053880810 0 disconnected
0da33215cac5a3429d9edc2087223839e30ed078
10.189.121.71:6386 slave ccef0a3b064069b291e25eb27d8303492c84fcd2 0 1403130262776 19 connected
161fffea420cf10c99a1abd1dcccb890ddf635aa
10.189.117.196:6383 master - 0 1403130262372 26 connected 8192-9215
97138d2812e259b89f36efb4be90cb8acb8e4820
10.6.213.15:6385 slave 518a6c613d9ac8ec72ae121a0bf1c7ead6a2bdc1 0 1403130263380 57 connected
0be86252ed21078e3d9109ffcbfbaf201402e71e
10.97.153.90:6381 master - 0 1403130264277 51 connected 6144-7167
On the otherhand, when we do cluster info and cluster nodes on different nodes on the new machine the results were different:
./redis-cli -p 6379 cluster info
cluster_state:fail
cluster_slots_assigned:16384
cluster_slots_ok:13312
cluster_slots_pfail:0
cluster_slots_fail:3072
cluster_known_nodes:40
cluster_size:16
cluster_current_epoch:58
cluster_stats_messages_sent:34694
cluster_stats_messages_received:34728
./redis-cli -p 6379 cluster nodes
358e88aaa180df3c14af70324a15b05499a6f81c
10.6.213.15:6379 master - 0 1403130030247 58 connected 3072-4095
ef0a9726b7249929e9f07712f8421d4462c5ac45
10.97.153.90:6379 myself,slave 358e88aaa180df3c14af70324a15b05499a6f81c 0 0 49 connected
9f44db168f05455b6f3cbf6fee541142f4e6cef1
10.189.121.71:6382 slave 5f17d0a9ae3ea43a1f9681c6f5002dff4b732fd0 0 1403130031250 42 connected
cae7cc5757c510707c1ba1854c634fe717cb8134
10.6.213.15:6386 master - 0 1403130029444 31 connected 15360-16383
518a6c613d9ac8ec72ae121a0bf1c7ead6a2bdc1
10.97.153.90:6386 master - 0 1403130031749 57 connected 14336-15359
db7aeaac383544ff2c4380ad8c69a082501b6c2c
10.189.121.71:6380 slave 73f41bbdb091b5f6cfe51f48e9b115b0410f01ef 0 1403130029947 53 connected
97138d2812e259b89f36efb4be90cb8acb8e4820
10.6.213.15:6385 slave 518a6c613d9ac8ec72ae121a0bf1c7ead6a2bdc1 0 1403130030948 57 connected
9a70061d0d0d9438efc8af2eb31c9fb95863f94c
10.189.117.196:6380 slave 0be86252ed21078e3d9109ffcbfbaf201402e71e 0 1403130029745 51 connected
899c6e0419d857fa3ea16c49bada01404efbef99
10.104.123.245:6382 slave d24d256eaec99d0c7de1b50e7ac6988fa146950a 0 1403130030247 54 connected
e18501eda6f5b34f316caeb4675e177452d492b6
10.6.213.15:6383 slave bc4e0ed79b15f919523f227a27d8814d2dbf59db 0 1403130030348 15 connected
0da33215cac5a3429d9edc2087223839e30ed078
10.189.121.71:6386 slave ccef0a3b064069b291e25eb27d8303492c84fcd2 0 1403130031750 19 connected
778a223b3489be8d446f67e18238bdac3e66881c
10.104.123.245:6384 slave d5e61834b27581515f725bab874d78da6788d92f 0 1403130030046 39 connected
d75e5ad9d0acbd2586befc7bae0039d9cc789ecd
10.97.153.90:6385 master - 0 1403130030245 56 connected 11264-12287
ccef0a3b064069b291e25eb27d8303492c84fcd2
10.6.213.15:6380 master - 0 1403130031252 19 connected 7168-8191
6c48fc68903672be03920f63d9929c25e700ff67
10.189.117.196:6384 master - 0 1403130030247 10 connected 9216-10239
b90493db2bf13dc2bb7692c057bdb06f097355b8
10.104.123.245:6385 slave 358e88aaa180df3c14af70324a15b05499a6f81c 0 1403130030747 58 connected
22f38ad8afc3794ad43454d886abea198030fcfb
10.97.153.90:6380 master,fail,noaddr - 1403128732610 1403128732610 50 disconnected 1024-2047
d5e61834b27581515f725bab874d78da6788d92f
10.189.117.196:6379 master - 0 1403130030247 39 connected 2048-3071
161fffea420cf10c99a1abd1dcccb890ddf635aa
10.189.117.196:6383 master - 0 1403130030747 26 connected 8192-9215
4b60029245e5e1a500a3fb0181c06eb25864bf3c
10.6.213.15:6382 slave cae7cc5757c510707c1ba1854c634fe717cb8134 0 1403130029648 31 connected
52e3ea6e35a2c1466083f423167ce144d083ae31
10.104.123.245:6380 master - 0 1403130031250 43 connected 4096-5119
2f874d0b518991a5562c2df50118ff3830d59a1b
10.104.123.245:6381 slave 161fffea420cf10c99a1abd1dcccb890ddf635aa 0 1403130030848 26 connected
a0582038e8d5d8f88dd0b18d68eb77b8dc104b16
10.104.123.245:6379 slave 8a113f25261adc4f7efa68c06cbcd80720ccb18d 0 1403130029748 55 connected
04860a24595e424be0be484812114b73967ce23e
10.189.121.71:6384 slave 52e3ea6e35a2c1466083f423167ce144d083ae31 0 1403130029847 43 connected
d24d256eaec99d0c7de1b50e7ac6988fa146950a
10.97.153.90:6383 master - 0 1403130030745 54 connected 12288-13311
183b0438beb265ac7bdd0931f7788ed128745ef7
10.189.117.196:6386 slave 0be86252ed21078e3d9109ffcbfbaf201402e71e 0 1403130030747 51 connected
3205d0677913b5eb2db6aa70799387f2a9ee278b
10.6.213.15:6381 slave d75e5ad9d0acbd2586befc7bae0039d9cc789ecd 0 1403130030247 56 connected
73f41bbdb091b5f6cfe51f48e9b115b0410f01ef
10.97.153.90:6382 master,fail,noaddr - 1403128732610 1403128732610 53 disconnected 5120-6143
14e2d217d3d3db1e28a6b5c0a4b85bbb49fc2204
10.189.121.71:6385 slave d24d256eaec99d0c7de1b50e7ac6988fa146950a 0 1403130029846 54 connected
b5051325dff7f454846ee174102b73eb343e746c
10.189.121.71:6379 slave 22f38ad8afc3794ad43454d886abea198030fcfb 0 1403130031650 50 connected
6a39e0e52e9c2065d9246f490fd96acc9859f3b4
10.189.117.196:6385 slave 518a6c613d9ac8ec72ae121a0bf1c7ead6a2bdc1 0 1403130030949 57 connected
5f17d0a9ae3ea43a1f9681c6f5002dff4b732fd0
10.6.213.15:6384 master - 0 1403130031049 42 connected 13312-14335
bc4e0ed79b15f919523f227a27d8814d2dbf59db
10.189.117.196:6381 master - 0 1403130029745 15 connected 10240-11263
b1d312de0296fe69bca9888d39ba00bba512da4a
10.189.121.71:6383 slave 8a113f25261adc4f7efa68c06cbcd80720ccb18d 0 1403130030748 55 connected
98ec422a5176e51e2cf6b54af54def067f325331
10.189.121.71:6381 slave 6c48fc68903672be03920f63d9929c25e700ff67 0 1403130031751 10 connected
80e8cda2b70a08651f7de5642f8a4b4260bbdb9a
10.104.123.245:6386 slave 73f41bbdb091b5f6cfe51f48e9b115b0410f01ef 0 1403130031650 53 connected
02c776b1fd86ac51e7db1112ef67ad591077b90a
10.189.117.196:6382 slave 5f17d0a9ae3ea43a1f9681c6f5002dff4b732fd0 0 1403130031250 53 connected
ba27bda55a589f587f189988f7f26158147eb6df
10.104.123.245:6383 slave 22f38ad8afc3794ad43454d886abea198030fcfb 0 1403130031758 50 connected
8a113f25261adc4f7efa68c06cbcd80720ccb18d
10.97.153.90:6384 master - 0 1403130031248 55 connected 0-1023
0be86252ed21078e3d9109ffcbfbaf201402e71e
10.97.153.90:6381 master,fail,noaddr - 1403128732610 1403128732610 51 disconnected 6144-7167
./redis-cli -c -p 6380 cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:40
cluster_size:16
cluster_current_epoch:58
cluster_stats_messages_sent:2006300
cluster_stats_messages_received:2006020
./redis-cli -c -p 6380 cluster nodes
22f38ad8afc3794ad43454d886abea198030fcfb :0 myself,master - 0 0 50 connected 1024-2047
97138d2812e259b89f36efb4be90cb8acb8e4820
10.6.213.15:6385 slave 518a6c613d9ac8ec72ae121a0bf1c7ead6a2bdc1 0 1403130715378 57 connected
52e3ea6e35a2c1466083f423167ce144d083ae31
10.104.123.245:6380 master - 0 1403130716983 43 connected 4096-5119
6a39e0e52e9c2065d9246f490fd96acc9859f3b4
10.189.117.196:6385 slave 518a6c613d9ac8ec72ae121a0bf1c7ead6a2bdc1 0 1403130716783 57 connected
ccef0a3b064069b291e25eb27d8303492c84fcd2
10.6.213.15:6380 master - 0 1403130715180 19 connected 7168-8191
14e2d217d3d3db1e28a6b5c0a4b85bbb49fc2204
10.189.121.71:6385 slave d24d256eaec99d0c7de1b50e7ac6988fa146950a 0 1403130716581 54 connected
02c776b1fd86ac51e7db1112ef67ad591077b90a
10.189.117.196:6382 slave 5f17d0a9ae3ea43a1f9681c6f5002dff4b732fd0 0 1403130716983 53 connected
d5e61834b27581515f725bab874d78da6788d92f
10.189.117.196:6379 master - 0 1403130716380 39 connected 2048-3071
2f874d0b518991a5562c2df50118ff3830d59a1b
10.104.123.245:6381 slave 161fffea420cf10c99a1abd1dcccb890ddf635aa 0 1403130715379 26 connected
183b0438beb265ac7bdd0931f7788ed128745ef7
10.189.117.196:6386 slave 0be86252ed21078e3d9109ffcbfbaf201402e71e 0 1403130716983 51 connected
9f44db168f05455b6f3cbf6fee541142f4e6cef1
10.189.121.71:6382 slave 5f17d0a9ae3ea43a1f9681c6f5002dff4b732fd0 0 1403130716982 42 connected
899c6e0419d857fa3ea16c49bada01404efbef99
10.104.123.245:6382 slave d24d256eaec99d0c7de1b50e7ac6988fa146950a 0 1403130716180 54 connected
d24d256eaec99d0c7de1b50e7ac6988fa146950a
10.97.153.90:6383 master - 0 1403130716479 54 connected 12288-13311
cae7cc5757c510707c1ba1854c634fe717cb8134
10.6.213.15:6386 master - 0 1403130715581 31 connected 15360-16383
db7aeaac383544ff2c4380ad8c69a082501b6c2c
10.189.121.71:6380 slave 73f41bbdb091b5f6cfe51f48e9b115b0410f01ef 0 1403130716180 53 connected
161fffea420cf10c99a1abd1dcccb890ddf635aa
10.189.117.196:6383 master - 0 1403130716380 26 connected 8192-9215
bc4e0ed79b15f919523f227a27d8814d2dbf59db
10.189.117.196:6381 master - 0 1403130714676 15 connected 10240-11263
d75e5ad9d0acbd2586befc7bae0039d9cc789ecd
10.97.153.90:6385 master - 0 1403130715477 56 connected 11264-12287
73f41bbdb091b5f6cfe51f48e9b115b0410f01ef
10.97.153.90:6382 master - 0 1403130715477 53 connected 5120-6143
ef0a9726b7249929e9f07712f8421d4462c5ac45
10.97.153.90:6379 slave 358e88aaa180df3c14af70324a15b05499a6f81c 0 1403130716479 58 connected
b90493db2bf13dc2bb7692c057bdb06f097355b8
10.104.123.245:6385 slave 358e88aaa180df3c14af70324a15b05499a6f81c 0 1403130716280 58 connected
04860a24595e424be0be484812114b73967ce23e
10.189.121.71:6384 slave 52e3ea6e35a2c1466083f423167ce144d083ae31 0 1403130715479 43 connected
98ec422a5176e51e2cf6b54af54def067f325331
10.189.121.71:6381 slave 6c48fc68903672be03920f63d9929c25e700ff67 0 1403130715481 10 connected
ba27bda55a589f587f189988f7f26158147eb6df
10.104.123.245:6383 slave 22f38ad8afc3794ad43454d886abea198030fcfb 0 1403130716584 50 connected
6c48fc68903672be03920f63d9929c25e700ff67
10.189.117.196:6384 master - 0 1403130716281 10 connected 9216-10239
0da33215cac5a3429d9edc2087223839e30ed078
10.189.121.71:6386 slave ccef0a3b064069b291e25eb27d8303492c84fcd2 0 1403130717083 19 connected
5f17d0a9ae3ea43a1f9681c6f5002dff4b732fd0
10.6.213.15:6384 master - 0 1403130714980 42 connected 13312-14335
0be86252ed21078e3d9109ffcbfbaf201402e71e
10.97.153.90:6381 master - 0 1403130715977 51 connected 6144-7167
4b60029245e5e1a500a3fb0181c06eb25864bf3c
10.6.213.15:6382 slave cae7cc5757c510707c1ba1854c634fe717cb8134 0 1403130716783 31 connected
e18501eda6f5b34f316caeb4675e177452d492b6
10.6.213.15:6383 slave bc4e0ed79b15f919523f227a27d8814d2dbf59db 0 1403130716581 15 connected
358e88aaa180df3c14af70324a15b05499a6f81c
10.6.213.15:6379 master - 0 1403130715378 58 connected 3072-4095
80e8cda2b70a08651f7de5642f8a4b4260bbdb9a
10.104.123.245:6386 slave 73f41bbdb091b5f6cfe51f48e9b115b0410f01ef 0 1403130714979 53 connected
b5051325dff7f454846ee174102b73eb343e746c
10.189.121.71:6379 slave 22f38ad8afc3794ad43454d886abea198030fcfb 0 1403130714777 50 connected
9a70061d0d0d9438efc8af2eb31c9fb95863f94c
10.189.117.196:6380 slave 0be86252ed21078e3d9109ffcbfbaf201402e71e 0 1403130716987 51 connected
b1d312de0296fe69bca9888d39ba00bba512da4a
10.189.121.71:6383 slave 8a113f25261adc4f7efa68c06cbcd80720ccb18d 0 1403130714977 55 connected
778a223b3489be8d446f67e18238bdac3e66881c
10.104.123.245:6384 slave d5e61834b27581515f725bab874d78da6788d92f 0 1403130715480 39 connected
3205d0677913b5eb2db6aa70799387f2a9ee278b
10.6.213.15:6381 slave d75e5ad9d0acbd2586befc7bae0039d9cc789ecd 0 1403130715480 56 connected
a0582038e8d5d8f88dd0b18d68eb77b8dc104b16
10.104.123.245:6379 slave 8a113f25261adc4f7efa68c06cbcd80720ccb18d 0 1403130715078 55 connected
8a113f25261adc4f7efa68c06cbcd80720ccb18d
10.97.153.90:6384 master - 0 1403130714976 55 connected 0-1023
518a6c613d9ac8ec72ae121a0bf1c7ead6a2bdc1
10.97.153.90:6386 master - 0 1403130715977 57 connected 14336-15359
Do you guys have any idea on why this happened and how we could fix this issue?
Thanks,
Phu