Inconsistent Redis Cluster State After Adding New Master Nodes

530 views
Skip to first unread message

Phu Huynh

unread,
Jun 18, 2014, 6:33:30 PM6/18/14
to redi...@googlegroups.com
Previously we had a cluster running with 32 redis nodes (16 masters, 16 slaves) split evenly between 4 machines. Afterwards we booted up a new machine and started up 8 nodes, and added them to the cluster as replicas. Up to this point everything was OK. Afterwards, we promoted them to masters by killing any slaves that were connected to the same master as them, and then killed the master. We then restarted the nodes that we killed so that they would turn into slaves. At this point cluster went down and we were unable to query it.

When we do cluster info and cluster nodes on a node on any of the 4 old machines everything looks ok. Machine 10.97.153.90 is the new machine:
./redis-cli -p 6379 cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:43
cluster_size:16
cluster_current_epoch:58
cluster_stats_messages_sent:2405700
cluster_stats_messages_received:237245

./redis-cli -p 6379 cluster nodes
358e88aaa180df3c14af70324a15b05499a6f81c 10.6.213.15:6379 master - 0 1403130264076 58 connected 3072-4095
22f38ad8afc3794ad43454d886abea198030fcfb 10.97.153.90:6380 master - 0 1403130264277 50 connected 1024-2047
ba27bda55a589f587f189988f7f26158147eb6df 10.104.123.245:6383 slave 22f38ad8afc3794ad43454d886abea198030fcfb 0 1403130263472 50 connected
6c48fc68903672be03920f63d9929c25e700ff67 10.189.117.196:6384 master - 0 1403130262371 10 connected 9216-10239
14e2d217d3d3db1e28a6b5c0a4b85bbb49fc2204 10.189.121.71:6385 slave d24d256eaec99d0c7de1b50e7ac6988fa146950a 0 1403130263676 54 connected
d24d256eaec99d0c7de1b50e7ac6988fa146950a 10.97.153.90:6383 master - 0 1403130262072 54 connected 12288-13311
ccef0a3b064069b291e25eb27d8303492c84fcd2 10.6.213.15:6380 master - 0 1403130263775 19 connected 7168-8191
52e3ea6e35a2c1466083f423167ce144d083ae31 10.104.123.245:6380 master - 0 1403130262270 43 connected 4096-5119
db7aeaac383544ff2c4380ad8c69a082501b6c2c 10.189.121.71:6380 slave 73f41bbdb091b5f6cfe51f48e9b115b0410f01ef 0 1403130262272 53 connected
3205d0677913b5eb2db6aa70799387f2a9ee278b 10.6.213.15:6381 slave d75e5ad9d0acbd2586befc7bae0039d9cc789ecd 0 1403130263876 56 connected
d5e61834b27581515f725bab874d78da6788d92f 10.189.117.196:6379 master - 0 1403130263273 39 connected 2048-3071
b5051325dff7f454846ee174102b73eb343e746c 10.189.121.71:6379 slave 22f38ad8afc3794ad43454d886abea198030fcfb 0 1403130264280 50 connected
ef0a9726b7249929e9f07712f8421d4462c5ac45 10.97.153.90:6379 slave 358e88aaa180df3c14af70324a15b05499a6f81c 0 1403130263675 58 connected
98ec422a5176e51e2cf6b54af54def067f325331 10.189.121.71:6381 slave 6c48fc68903672be03920f63d9929c25e700ff67 0 1403130261872 10 connected
f4ecb51aab2326866f3118f9bb8e467c5803d5ec :0 master,fail,noaddr - 1403053880810 1403053880810 0 disconnected
80e8cda2b70a08651f7de5642f8a4b4260bbdb9a 10.104.123.245:6386 slave 73f41bbdb091b5f6cfe51f48e9b115b0410f01ef 0 1403130263272 53 connected
73f41bbdb091b5f6cfe51f48e9b115b0410f01ef 10.97.153.90:6382 master - 0 1403130261972 53 connected 5120-6143
8a113f25261adc4f7efa68c06cbcd80720ccb18d 10.97.153.90:6384 master - 0 1403130262072 55 connected 0-1023
b90493db2bf13dc2bb7692c057bdb06f097355b8 10.104.123.245:6385 slave 358e88aaa180df3c14af70324a15b05499a6f81c 0 1403130263372 58 connected
9f44db168f05455b6f3cbf6fee541142f4e6cef1 10.189.121.71:6382 slave 5f17d0a9ae3ea43a1f9681c6f5002dff4b732fd0 0 1403130262171 42 connected
04860a24595e424be0be484812114b73967ce23e 10.189.121.71:6384 slave 52e3ea6e35a2c1466083f423167ce144d083ae31 0 1403130262572 43 connected
899c6e0419d857fa3ea16c49bada01404efbef99 10.104.123.245:6382 slave d24d256eaec99d0c7de1b50e7ac6988fa146950a 0 1403130263372 54 connected
02c776b1fd86ac51e7db1112ef67ad591077b90a 10.189.117.196:6382 slave 5f17d0a9ae3ea43a1f9681c6f5002dff4b732fd0 0 1403130262772 53 connected
808259e8b4034c134a3ebaee0412538520ef8a66 :0 master,fail,noaddr - 1403053880810 1403053880810 0 disconnected
4b60029245e5e1a500a3fb0181c06eb25864bf3c 10.6.213.15:6382 slave cae7cc5757c510707c1ba1854c634fe717cb8134 0 1403130262976 31 connected
6a39e0e52e9c2065d9246f490fd96acc9859f3b4 10.189.117.196:6385 slave 518a6c613d9ac8ec72ae121a0bf1c7ead6a2bdc1 0 1403130263476 57 connected
a0582038e8d5d8f88dd0b18d68eb77b8dc104b16 :0 myself,slave 8a113f25261adc4f7efa68c06cbcd80720ccb18d 0 0 44 connected
183b0438beb265ac7bdd0931f7788ed128745ef7 10.189.117.196:6386 slave 0be86252ed21078e3d9109ffcbfbaf201402e71e 0 1403130261970 51 connected
778a223b3489be8d446f67e18238bdac3e66881c 10.104.123.245:6384 slave d5e61834b27581515f725bab874d78da6788d92f 0 1403130264276 39 connected
5f17d0a9ae3ea43a1f9681c6f5002dff4b732fd0 10.6.213.15:6384 master - 0 1403130262272 42 connected 13312-14335
b1d312de0296fe69bca9888d39ba00bba512da4a 10.189.121.71:6383 slave 8a113f25261adc4f7efa68c06cbcd80720ccb18d 0 1403130263276 55 connected
e18501eda6f5b34f316caeb4675e177452d492b6 10.6.213.15:6383 slave bc4e0ed79b15f919523f227a27d8814d2dbf59db 0 1403130262572 15 connected
d75e5ad9d0acbd2586befc7bae0039d9cc789ecd 10.97.153.90:6385 master - 0 1403130262573 56 connected 11264-12287
518a6c613d9ac8ec72ae121a0bf1c7ead6a2bdc1 10.97.153.90:6386 master - 0 1403130263174 57 connected 14336-15359
9a70061d0d0d9438efc8af2eb31c9fb95863f94c 10.189.117.196:6380 slave 0be86252ed21078e3d9109ffcbfbaf201402e71e 0 1403130263676 51 connected
cae7cc5757c510707c1ba1854c634fe717cb8134 10.6.213.15:6386 master - 0 1403130263775 31 connected 15360-16383
2f874d0b518991a5562c2df50118ff3830d59a1b 10.104.123.245:6381 slave 161fffea420cf10c99a1abd1dcccb890ddf635aa 0 1403130263472 26 connected
bc4e0ed79b15f919523f227a27d8814d2dbf59db 10.189.117.196:6381 master - 0 1403130263576 15 connected 10240-11263
88be79fec7a3b12aa2e28e72a620ce78f684b36e :0 master,fail,noaddr - 1403053880810 1403053880810 0 disconnected
0da33215cac5a3429d9edc2087223839e30ed078 10.189.121.71:6386 slave ccef0a3b064069b291e25eb27d8303492c84fcd2 0 1403130262776 19 connected
161fffea420cf10c99a1abd1dcccb890ddf635aa 10.189.117.196:6383 master - 0 1403130262372 26 connected 8192-9215
97138d2812e259b89f36efb4be90cb8acb8e4820 10.6.213.15:6385 slave 518a6c613d9ac8ec72ae121a0bf1c7ead6a2bdc1 0 1403130263380 57 connected
0be86252ed21078e3d9109ffcbfbaf201402e71e 10.97.153.90:6381 master - 0 1403130264277 51 connected 6144-7167

On the otherhand, when we do cluster info and cluster nodes on different nodes on the new machine the results were different:
./redis-cli -p 6379 cluster info
cluster_state:fail
cluster_slots_assigned:16384
cluster_slots_ok:13312
cluster_slots_pfail:0
cluster_slots_fail:3072
cluster_known_nodes:40
cluster_size:16
cluster_current_epoch:58
cluster_stats_messages_sent:34694
cluster_stats_messages_received:34728

./redis-cli -p 6379 cluster nodes
358e88aaa180df3c14af70324a15b05499a6f81c 10.6.213.15:6379 master - 0 1403130030247 58 connected 3072-4095
ef0a9726b7249929e9f07712f8421d4462c5ac45 10.97.153.90:6379 myself,slave 358e88aaa180df3c14af70324a15b05499a6f81c 0 0 49 connected
9f44db168f05455b6f3cbf6fee541142f4e6cef1 10.189.121.71:6382 slave 5f17d0a9ae3ea43a1f9681c6f5002dff4b732fd0 0 1403130031250 42 connected
cae7cc5757c510707c1ba1854c634fe717cb8134 10.6.213.15:6386 master - 0 1403130029444 31 connected 15360-16383
518a6c613d9ac8ec72ae121a0bf1c7ead6a2bdc1 10.97.153.90:6386 master - 0 1403130031749 57 connected 14336-15359
db7aeaac383544ff2c4380ad8c69a082501b6c2c 10.189.121.71:6380 slave 73f41bbdb091b5f6cfe51f48e9b115b0410f01ef 0 1403130029947 53 connected
97138d2812e259b89f36efb4be90cb8acb8e4820 10.6.213.15:6385 slave 518a6c613d9ac8ec72ae121a0bf1c7ead6a2bdc1 0 1403130030948 57 connected
9a70061d0d0d9438efc8af2eb31c9fb95863f94c 10.189.117.196:6380 slave 0be86252ed21078e3d9109ffcbfbaf201402e71e 0 1403130029745 51 connected
899c6e0419d857fa3ea16c49bada01404efbef99 10.104.123.245:6382 slave d24d256eaec99d0c7de1b50e7ac6988fa146950a 0 1403130030247 54 connected
e18501eda6f5b34f316caeb4675e177452d492b6 10.6.213.15:6383 slave bc4e0ed79b15f919523f227a27d8814d2dbf59db 0 1403130030348 15 connected
0da33215cac5a3429d9edc2087223839e30ed078 10.189.121.71:6386 slave ccef0a3b064069b291e25eb27d8303492c84fcd2 0 1403130031750 19 connected
778a223b3489be8d446f67e18238bdac3e66881c 10.104.123.245:6384 slave d5e61834b27581515f725bab874d78da6788d92f 0 1403130030046 39 connected
d75e5ad9d0acbd2586befc7bae0039d9cc789ecd 10.97.153.90:6385 master - 0 1403130030245 56 connected 11264-12287
ccef0a3b064069b291e25eb27d8303492c84fcd2 10.6.213.15:6380 master - 0 1403130031252 19 connected 7168-8191
6c48fc68903672be03920f63d9929c25e700ff67 10.189.117.196:6384 master - 0 1403130030247 10 connected 9216-10239
b90493db2bf13dc2bb7692c057bdb06f097355b8 10.104.123.245:6385 slave 358e88aaa180df3c14af70324a15b05499a6f81c 0 1403130030747 58 connected
22f38ad8afc3794ad43454d886abea198030fcfb 10.97.153.90:6380 master,fail,noaddr - 1403128732610 1403128732610 50 disconnected 1024-2047
d5e61834b27581515f725bab874d78da6788d92f 10.189.117.196:6379 master - 0 1403130030247 39 connected 2048-3071
161fffea420cf10c99a1abd1dcccb890ddf635aa 10.189.117.196:6383 master - 0 1403130030747 26 connected 8192-9215
4b60029245e5e1a500a3fb0181c06eb25864bf3c 10.6.213.15:6382 slave cae7cc5757c510707c1ba1854c634fe717cb8134 0 1403130029648 31 connected
52e3ea6e35a2c1466083f423167ce144d083ae31 10.104.123.245:6380 master - 0 1403130031250 43 connected 4096-5119
2f874d0b518991a5562c2df50118ff3830d59a1b 10.104.123.245:6381 slave 161fffea420cf10c99a1abd1dcccb890ddf635aa 0 1403130030848 26 connected
a0582038e8d5d8f88dd0b18d68eb77b8dc104b16 10.104.123.245:6379 slave 8a113f25261adc4f7efa68c06cbcd80720ccb18d 0 1403130029748 55 connected
04860a24595e424be0be484812114b73967ce23e 10.189.121.71:6384 slave 52e3ea6e35a2c1466083f423167ce144d083ae31 0 1403130029847 43 connected
d24d256eaec99d0c7de1b50e7ac6988fa146950a 10.97.153.90:6383 master - 0 1403130030745 54 connected 12288-13311
183b0438beb265ac7bdd0931f7788ed128745ef7 10.189.117.196:6386 slave 0be86252ed21078e3d9109ffcbfbaf201402e71e 0 1403130030747 51 connected
3205d0677913b5eb2db6aa70799387f2a9ee278b 10.6.213.15:6381 slave d75e5ad9d0acbd2586befc7bae0039d9cc789ecd 0 1403130030247 56 connected
73f41bbdb091b5f6cfe51f48e9b115b0410f01ef 10.97.153.90:6382 master,fail,noaddr - 1403128732610 1403128732610 53 disconnected 5120-6143
14e2d217d3d3db1e28a6b5c0a4b85bbb49fc2204 10.189.121.71:6385 slave d24d256eaec99d0c7de1b50e7ac6988fa146950a 0 1403130029846 54 connected
b5051325dff7f454846ee174102b73eb343e746c 10.189.121.71:6379 slave 22f38ad8afc3794ad43454d886abea198030fcfb 0 1403130031650 50 connected
6a39e0e52e9c2065d9246f490fd96acc9859f3b4 10.189.117.196:6385 slave 518a6c613d9ac8ec72ae121a0bf1c7ead6a2bdc1 0 1403130030949 57 connected
5f17d0a9ae3ea43a1f9681c6f5002dff4b732fd0 10.6.213.15:6384 master - 0 1403130031049 42 connected 13312-14335
bc4e0ed79b15f919523f227a27d8814d2dbf59db 10.189.117.196:6381 master - 0 1403130029745 15 connected 10240-11263
b1d312de0296fe69bca9888d39ba00bba512da4a 10.189.121.71:6383 slave 8a113f25261adc4f7efa68c06cbcd80720ccb18d 0 1403130030748 55 connected
98ec422a5176e51e2cf6b54af54def067f325331 10.189.121.71:6381 slave 6c48fc68903672be03920f63d9929c25e700ff67 0 1403130031751 10 connected
80e8cda2b70a08651f7de5642f8a4b4260bbdb9a 10.104.123.245:6386 slave 73f41bbdb091b5f6cfe51f48e9b115b0410f01ef 0 1403130031650 53 connected
02c776b1fd86ac51e7db1112ef67ad591077b90a 10.189.117.196:6382 slave 5f17d0a9ae3ea43a1f9681c6f5002dff4b732fd0 0 1403130031250 53 connected
ba27bda55a589f587f189988f7f26158147eb6df 10.104.123.245:6383 slave 22f38ad8afc3794ad43454d886abea198030fcfb 0 1403130031758 50 connected
8a113f25261adc4f7efa68c06cbcd80720ccb18d 10.97.153.90:6384 master - 0 1403130031248 55 connected 0-1023
0be86252ed21078e3d9109ffcbfbaf201402e71e 10.97.153.90:6381 master,fail,noaddr - 1403128732610 1403128732610 51 disconnected 6144-7167

./redis-cli -c -p 6380 cluster info
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384
cluster_slots_pfail:0
cluster_slots_fail:0
cluster_known_nodes:40
cluster_size:16
cluster_current_epoch:58
cluster_stats_messages_sent:2006300
cluster_stats_messages_received:2006020

./redis-cli -c -p 6380 cluster nodes
22f38ad8afc3794ad43454d886abea198030fcfb :0 myself,master - 0 0 50 connected 1024-2047
97138d2812e259b89f36efb4be90cb8acb8e4820 10.6.213.15:6385 slave 518a6c613d9ac8ec72ae121a0bf1c7ead6a2bdc1 0 1403130715378 57 connected
52e3ea6e35a2c1466083f423167ce144d083ae31 10.104.123.245:6380 master - 0 1403130716983 43 connected 4096-5119
6a39e0e52e9c2065d9246f490fd96acc9859f3b4 10.189.117.196:6385 slave 518a6c613d9ac8ec72ae121a0bf1c7ead6a2bdc1 0 1403130716783 57 connected
ccef0a3b064069b291e25eb27d8303492c84fcd2 10.6.213.15:6380 master - 0 1403130715180 19 connected 7168-8191
14e2d217d3d3db1e28a6b5c0a4b85bbb49fc2204 10.189.121.71:6385 slave d24d256eaec99d0c7de1b50e7ac6988fa146950a 0 1403130716581 54 connected
02c776b1fd86ac51e7db1112ef67ad591077b90a 10.189.117.196:6382 slave 5f17d0a9ae3ea43a1f9681c6f5002dff4b732fd0 0 1403130716983 53 connected
d5e61834b27581515f725bab874d78da6788d92f 10.189.117.196:6379 master - 0 1403130716380 39 connected 2048-3071
2f874d0b518991a5562c2df50118ff3830d59a1b 10.104.123.245:6381 slave 161fffea420cf10c99a1abd1dcccb890ddf635aa 0 1403130715379 26 connected
183b0438beb265ac7bdd0931f7788ed128745ef7 10.189.117.196:6386 slave 0be86252ed21078e3d9109ffcbfbaf201402e71e 0 1403130716983 51 connected
9f44db168f05455b6f3cbf6fee541142f4e6cef1 10.189.121.71:6382 slave 5f17d0a9ae3ea43a1f9681c6f5002dff4b732fd0 0 1403130716982 42 connected
899c6e0419d857fa3ea16c49bada01404efbef99 10.104.123.245:6382 slave d24d256eaec99d0c7de1b50e7ac6988fa146950a 0 1403130716180 54 connected
d24d256eaec99d0c7de1b50e7ac6988fa146950a 10.97.153.90:6383 master - 0 1403130716479 54 connected 12288-13311
cae7cc5757c510707c1ba1854c634fe717cb8134 10.6.213.15:6386 master - 0 1403130715581 31 connected 15360-16383
db7aeaac383544ff2c4380ad8c69a082501b6c2c 10.189.121.71:6380 slave 73f41bbdb091b5f6cfe51f48e9b115b0410f01ef 0 1403130716180 53 connected
161fffea420cf10c99a1abd1dcccb890ddf635aa 10.189.117.196:6383 master - 0 1403130716380 26 connected 8192-9215
bc4e0ed79b15f919523f227a27d8814d2dbf59db 10.189.117.196:6381 master - 0 1403130714676 15 connected 10240-11263
d75e5ad9d0acbd2586befc7bae0039d9cc789ecd 10.97.153.90:6385 master - 0 1403130715477 56 connected 11264-12287
73f41bbdb091b5f6cfe51f48e9b115b0410f01ef 10.97.153.90:6382 master - 0 1403130715477 53 connected 5120-6143
ef0a9726b7249929e9f07712f8421d4462c5ac45 10.97.153.90:6379 slave 358e88aaa180df3c14af70324a15b05499a6f81c 0 1403130716479 58 connected
b90493db2bf13dc2bb7692c057bdb06f097355b8 10.104.123.245:6385 slave 358e88aaa180df3c14af70324a15b05499a6f81c 0 1403130716280 58 connected
04860a24595e424be0be484812114b73967ce23e 10.189.121.71:6384 slave 52e3ea6e35a2c1466083f423167ce144d083ae31 0 1403130715479 43 connected
98ec422a5176e51e2cf6b54af54def067f325331 10.189.121.71:6381 slave 6c48fc68903672be03920f63d9929c25e700ff67 0 1403130715481 10 connected
ba27bda55a589f587f189988f7f26158147eb6df 10.104.123.245:6383 slave 22f38ad8afc3794ad43454d886abea198030fcfb 0 1403130716584 50 connected
6c48fc68903672be03920f63d9929c25e700ff67 10.189.117.196:6384 master - 0 1403130716281 10 connected 9216-10239
0da33215cac5a3429d9edc2087223839e30ed078 10.189.121.71:6386 slave ccef0a3b064069b291e25eb27d8303492c84fcd2 0 1403130717083 19 connected
5f17d0a9ae3ea43a1f9681c6f5002dff4b732fd0 10.6.213.15:6384 master - 0 1403130714980 42 connected 13312-14335
0be86252ed21078e3d9109ffcbfbaf201402e71e 10.97.153.90:6381 master - 0 1403130715977 51 connected 6144-7167
4b60029245e5e1a500a3fb0181c06eb25864bf3c 10.6.213.15:6382 slave cae7cc5757c510707c1ba1854c634fe717cb8134 0 1403130716783 31 connected
e18501eda6f5b34f316caeb4675e177452d492b6 10.6.213.15:6383 slave bc4e0ed79b15f919523f227a27d8814d2dbf59db 0 1403130716581 15 connected
358e88aaa180df3c14af70324a15b05499a6f81c 10.6.213.15:6379 master - 0 1403130715378 58 connected 3072-4095
80e8cda2b70a08651f7de5642f8a4b4260bbdb9a 10.104.123.245:6386 slave 73f41bbdb091b5f6cfe51f48e9b115b0410f01ef 0 1403130714979 53 connected
b5051325dff7f454846ee174102b73eb343e746c 10.189.121.71:6379 slave 22f38ad8afc3794ad43454d886abea198030fcfb 0 1403130714777 50 connected
9a70061d0d0d9438efc8af2eb31c9fb95863f94c 10.189.117.196:6380 slave 0be86252ed21078e3d9109ffcbfbaf201402e71e 0 1403130716987 51 connected
b1d312de0296fe69bca9888d39ba00bba512da4a 10.189.121.71:6383 slave 8a113f25261adc4f7efa68c06cbcd80720ccb18d 0 1403130714977 55 connected
778a223b3489be8d446f67e18238bdac3e66881c 10.104.123.245:6384 slave d5e61834b27581515f725bab874d78da6788d92f 0 1403130715480 39 connected
3205d0677913b5eb2db6aa70799387f2a9ee278b 10.6.213.15:6381 slave d75e5ad9d0acbd2586befc7bae0039d9cc789ecd 0 1403130715480 56 connected
a0582038e8d5d8f88dd0b18d68eb77b8dc104b16 10.104.123.245:6379 slave 8a113f25261adc4f7efa68c06cbcd80720ccb18d 0 1403130715078 55 connected
8a113f25261adc4f7efa68c06cbcd80720ccb18d 10.97.153.90:6384 master - 0 1403130714976 55 connected 0-1023
518a6c613d9ac8ec72ae121a0bf1c7ead6a2bdc1 10.97.153.90:6386 master - 0 1403130715977 57 connected 14336-15359

Do you guys have any idea on why this happened and how we could fix this issue?

Thanks,
Phu

Salvatore Sanfilippo

unread,
Jun 19, 2014, 12:50:18 AM6/19/14
to Redis DB
Hello, I'll check more carefully today, just an additional info
gathering question, what version of Redis are you running (there was a
bug before the last beta)?
> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to redis-db+u...@googlegroups.com.
> To post to this group, send email to redi...@googlegroups.com.
> Visit this group at http://groups.google.com/group/redis-db.
> For more options, visit https://groups.google.com/d/optout.



--
Salvatore 'antirez' Sanfilippo
open source developer - GoPivotal
http://invece.org

"One would never undertake such a thing if one were not driven on by
some demon whom one can neither resist nor understand."
— George Orwell

Srihari Venkatesan

unread,
Jun 19, 2014, 2:49:06 AM6/19/14
to redi...@googlegroups.com
Hi Salvatore,

I work with Phu. We upgraded to Redis 3.0.0 Beta 6 few days before before we added the new box.

Thanks,
-Hari

Salvatore Sanfilippo

unread,
Jun 19, 2014, 3:10:41 AM6/19/14
to Redis DB
Great thank you, this makes simpler to reason about the issue. News ASAP.

Salvatore Sanfilippo

unread,
Jun 20, 2014, 3:40:44 AM6/20/14
to Redis DB
Hello,

I finally was able to analyze the data. Basically this seems like what happened:

1) During your operations at some point you switched IP addresses of
the new server, so all the new nodes were flagged NOADDR. Note that
this hypothesis seems supported by the fact that there is only one
place in cluster.c that flags a node NOADDR.
2) After the IP address switch, Redis Cluster is supposed to recover
and find the new address of the node. This indeed happens as the
ip/port in your listing is updated to what appears to be the right
one.
3) However here, for a bug in Redis Cluster, when updating the address
the flag NOADDR is not cleared, so the cluster will never try to
reconnect to the nodes. So nodes remain disconnected and in FAIL
state.

If you cherry pick the following commit from unstable, this issue
should be fixed:

22d17bc Cluster: clear NOADDR flag when updating node address.

However you need to update all the nodes and restart them AFAIK...

Thanks for your help, interested to continue the debugging to see if
this fixes the issue, and if you remember to have actually changed the
address.

A note about "1" in the chain of events. It is not strictly needed an
IP switch, it is exactly the same if you move instances from one
machine to another by copying files.

Salvatore

On Thu, Jun 19, 2014 at 12:33 AM, Phu Huynh <phuh...@gmail.com> wrote:

Srihari Venkatesan

unread,
Jun 20, 2014, 3:01:58 PM6/20/14
to redi...@googlegroups.com, brian.c...@xad.com
Thanks for the explanation Salvatore. We did not change the IP address of the box, but we did few other things listed below - but not sure if those might have caused the issue:

1. We initially did not open up ACLs for 1xxxx ports - this caused the redis-trib to hang when we tried to add them to the cluster. I am not sure if this would have caused the NOADDR flag.
2. After realizing this, we stopped the redis nodes on the new box.
3. Fixed the ACL.
4. Wipe the rdb, aof files and restart redis nodes.
5. This time redis-trib successfully added them to the cluster.

Can the above sequence of events lead to this state?
Reply all
Reply to author
Forward
0 new messages