Hi here. So, we have a Redis cluster with 3 masters and 3 slaves. Redis is accessed from our web servers and our long-processing-jobs servers (among other servers). We want to restart our server nodes, however, whenever we have done it, we will suffer from latency spikes (around 2-3 minutes each). Each node holds about 50GB of data. We use phpredis C extension to connect to the cluster. We were expecting to have clients redirected accordingly.
So, this is what we ran:
1.- On slave node "redis-cli shutdown" (this causes a latency spike). Right after, the slave shows as fail from "redis-cli cluster nodes" output from another node
aaaaaac548698c033267c74b6099e4a434971fbb
192.0.0.2:6379 myself,slave aaaaaa4dd75553c3b6735b9c6060d9c511db0519 0 0 44 connected
aaaaaaa2d4a5c0161a0e1ae4dbba021a9d8a9761
192.0.0.3:6379 slave,fail aaaaaa850b22df65413e1524ce481412d87c6b7c 1486099567186 1486099565185 50 disconnected
aaaaaa163054dd43c744cdf7e58dc56d9e49e1e5
192.0.0.4:6379 master - 0 1486099621277 49 connected 10923-16383
aaaaaa850b22df65413e1524ce481412d87c6b7c
192.0.0.5:6379 master - 0 1486099620777 50 connected 0-5460
aaaaaab8d3f60987d09365276e52414511308476
192.0.0.6:6379 slave aaaaaa163054dd43c744cdf7e58dc56d9e49e1e5 0 1486099619777 49 connected
aaaaaa4dd75553c3b6735b9c6060d9c511db0519
192.0.0.7:6379 master - 0 1486099620277 48 connected 5461-10922
2.- Reboot slave server and wait for node to rejoin the cluster. This causes a second latency spike once the slave receives the dataset from the master and restarts itself to load the dataset into memory. This is the output from "redis-cli cluster nodes" from the rebooted server (slave)
aaaaaac548698c033267c74b6099e4a434971fbb
192.0.0.2:6379 slave aaaaaa4dd75553c3b6735b9c6060d9c511db0519 0 0 48 connected
aaaaaaa2d4a5c0161a0e1ae4dbba021a9d8a9761
192.0.0.3:6379 myself,slave aaaaaa850b22df65413e1524ce481412d87c6b7c 0 0 47 connected
aaaaaa163054dd43c744cdf7e58dc56d9e49e1e5
192.0.0.4:6379 master - 0 1486100401395 49 connected 10923-16383
aaaaaa850b22df65413e1524ce481412d87c6b7c
192.0.0.5:6379 master - 0 1486100399863 50 connected 0-5460
aaaaaab8d3f60987d09365276e52414511308476
192.0.0.6:6379 slave aaaaaa163054dd43c744cdf7e58dc56d9e49e1e5 0 1486100399352 49 connected
aaaaaa4dd75553c3b6735b9c6060d9c511db0519
192.0.0.7:6379 master - 0 1486100400373 48 connected 5461-10922
3.- Trigger a failover so the slave becomes the master, from slave "redis-cli cluster failover" (this causes another latency spike). So, the old master becomes a slave and then restarts and loads the dataset into memory.
So, it took us 3 latency spikes to get just a server restarted and switch the master/slave roles. Are there any other better ways to restart redis nodes with no downtime? Maybe by tweaking some settings? Any help will be appreciated. Thanks!
Jorge