16327:M 23 Oct 04:39:46.012 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.16327:M 23 Oct 04:39:53.845 # Connection with slave 10.10.21.109:6379 lost.16327:M 23 Oct 04:39:53.845 # Connection with slave 10.10.30.129:6379 lost.16327:M 23 Oct 04:39:53.869 # CONFIG REWRITE executed with success.16327:M 23 Oct 04:39:53.943 * 1 changes in 10800 seconds. Saving...16327:M 23 Oct 04:39:53.944 * Background saving started by pid 2965029650:C 23 Oct 04:39:54.008 * DB saved on disk29650:C 23 Oct 04:39:54.008 * RDB: 0 MB of memory used by copy-on-write16327:M 23 Oct 04:39:54.044 * Background saving terminated with success16327:S 23 Oct 04:40:03.869 * SLAVE OF 10.10.21.109:6379 enabled (user request from 'id=432 addr=10.10.20.161:40138 fd=72 name=sentinel-ba425363-cmd age=17 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 events=rw cmd=exec')16327:S 23 Oct 04:40:03.870 # CONFIG REWRITE executed with success.16327:S 23 Oct 04:40:04.356 * Connecting to MASTER 10.10.21.109:637916327:S 23 Oct 04:40:04.375 * MASTER <-> SLAVE sync started16327:S 23 Oct 04:40:04.375 * Non blocking connect for SYNC fired the event.16327:S 23 Oct 04:40:04.376 * Master replied to PING, replication can continue...16327:S 23 Oct 04:40:04.376 * Partial resynchronization not possible (no cached master)16327:S 23 Oct 04:40:04.377 * Full resync from master: 83703af33e7f4783591e2fa7e6ccd950db5cdd10:19792172616327:S 23 Oct 04:40:04.404 * MASTER <-> SLAVE sync: receiving 245028 bytes from master16327:S 23 Oct 04:40:04.406 * MASTER <-> SLAVE sync: Flushing old data16327:S 23 Oct 04:40:04.407 * MASTER <-> SLAVE sync: Loading DB in memory16327:S 23 Oct 04:40:04.410 * MASTER <-> SLAVE sync: Finished with success16327:S 23 Oct 04:40:04.425 * Background append only file rewriting started by pid 2967116327:S 23 Oct 04:40:04.464 * AOF rewrite child asks to stop sending diffs.29671:C 23 Oct 04:40:04.464 * Parent agreed to stop sending diffs. Finalizing AOF...29671:C 23 Oct 04:40:04.464 * Concatenating 0.00 MB of AOF diff received from parent.29671:C 23 Oct 04:40:04.464 * SYNC append only file rewrite performed29671:C 23 Oct 04:40:04.464 * AOF rewrite: 0 MB of memory used by copy-on-write16327:S 23 Oct 04:40:04.475 * Background AOF rewrite terminated with success16327:S 23 Oct 04:40:04.475 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)16327:S 23 Oct 04:40:04.475 * Background AOF rewrite finished successfully
17755:M 23 Oct 04:39:52.174 # Connection with master lost.
16335:S 23 Oct 04:39:53.017 # Connection with master lost.16335:S 23 Oct 04:39:53.018 * Caching the disconnected master state.16335:S 23 Oct 04:39:53.018 * Discarding previously cached master state.16335:S 23 Oct 04:39:53.023 * SLAVE OF 10.10.21.109:6379 enabled (user request from 'id=5 addr=10.10.30.129:36444 fd=10 name=sentinel-21af3b90-cmd age=494042 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=14 qbuf-free=32754 obl=36 oll=0 omem=0 events=rw cmd=exec')16335:S 23 Oct 04:39:53.049 # CONFIG REWRITE executed with success.16335:S 23 Oct 04:39:53.311 # CONFIG REWRITE executed with success.16335:S 23 Oct 04:39:53.584 * Connecting to MASTER 10.10.21.109:637916335:S 23 Oct 04:39:53.584 * MASTER <-> SLAVE sync started16335:S 23 Oct 04:39:53.586 * Non blocking connect for SYNC fired the event.16335:S 23 Oct 04:39:53.587 * Master replied to PING, replication can continue...16335:S 23 Oct 04:39:53.590 * Partial resynchronization not possible (no cached master)16335:S 23 Oct 04:39:53.591 * Full resync from master: 83703af33e7f4783591e2fa7e6ccd950db5cdd10:19791867516335:S 23 Oct 04:39:53.690 * MASTER <-> SLAVE sync: receiving 245124 bytes from master16335:S 23 Oct 04:39:53.694 * MASTER <-> SLAVE sync: Flushing old data16335:S 23 Oct 04:39:53.696 * MASTER <-> SLAVE sync: Loading DB in memory
pidfile "/var/run/redis/redis.pid"port 6379tcp-backlog 65535unixsocket "/tmp/redis.sock"timeout 0tcp-keepalive 0loglevel noticelogfile "/var/log/apps/redis.log"syslog-enabled nosyslog-ident "redis"syslog-facility local0databases 16save 86400 1000
stop-writes-on-bgsave-error yesrdbcompression yesrdbchecksum yesdbfilename "dump.rdb"
dir "/apps/redis"slave-serve-stale-data yesslave-read-only yesrepl-ping-slave-period 10repl-timeout 60repl-disable-tcp-nodelay noslave-priority 100maxclients 10000maxmemory 64012kbappendonly yesappendfilename "appendonly.aof"appendfsync everysecno-appendfsync-on-rewrite yesauto-aof-rewrite-percentage 100auto-aof-rewrite-min-size 64mblua-time-limit 5000slowlog-log-slower-than 10000slowlog-max-len 128notify-keyspace-events ""
16335:S 23 Oct 04:39:53.023 * SLAVE OF 10.10.21.109:6379 enabled (user request from 'id=5 addr=10.10.30.129:36444
fd=10 name=sentinel-21af3b90-cmd age=494042 idle=0 flags=x db=0 sub=0
psub=0 multi=3 qbuf=14 qbuf-free=32754 obl=36 oll=0 omem=0 events=rw
cmd=exec')
Note the "name=sentinel" portion, part of the explanation of where the SLAVEOF command came from. This indicates you're running Sentinel to control your Redis master and slaves. Sentinel's function is to detect a master that has become unavailable and change one of the slaves to be the new master (and when the old master is available, change it to be a slave of the new master).
So the action of master changing to a different server is what Sentinel is designed to do, and your logs indicate your Sentinel is doing it. The reasons why Sentinel is changing the master should be found in the Sentinel logs. That's your best lead to the cause of the problem, and from there, the solution.
Your server performance graphing/monitoring will be able to confirm or deny that your Redis or Sentinel servers are experiencing slowness or brief freeze-ups. You'll have to add some monitoring to find out if the network between the Sentinels and Redis servers gets clogged or has hiccups.
redis16 - 10.10.10.227 -- 16327:M 23 Oct 04:39:46.012 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.sentinel72 - 10.10.20.161 -- 14162:X 23 Oct 04:39:51.260 # +sdown master rails 10.10.10.227 6379sentinel15 - 10.10.21.109 -- 17778:X 23 Oct 04:39:51.342 # +sdown master rails 10.10.10.227 6379sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:51.764 # +sdown master rails 10.10.10.227 6379sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:51.823 # +odown master rails 10.10.10.227 6379 #quorum 2/2sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:51.823 # +new-epoch 3sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:51.823 # +try-failover master rails 10.10.10.227 6379sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:51.859 # +vote-for-leader 21af3b909123f02a883c6b42dcbe868b7b449c59 3sentinel15 - 10.10.21.109 -- 17778:X 23 Oct 04:39:51.904 # +new-epoch 3sentinel15 - 10.10.21.109 -- 17778:X 23 Oct 04:39:51.909 # +vote-for-leader 21af3b909123f02a883c6b42dcbe868b7b449c59 3sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:51.912 # 10.10.21.109:26379 voted for 21af3b909123f02a883c6b42dcbe868b7b449c59 3sentinel16 - 10.10.10.227 -- 16405:X 23 Oct 04:39:51.959 # +new-epoch 3sentinel16 - 10.10.10.227 -- 16405:X 23 Oct 04:39:51.964 # +vote-for-leader 21af3b909123f02a883c6b42dcbe868b7b449c59 3sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:51.967 # 10.10.10.227:26379 voted for 21af3b909123f02a883c6b42dcbe868b7b449c59 3sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:52.015 # +elected-leader master rails 10.10.10.227 6379sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:52.015 # +failover-state-select-slave master rails 10.10.10.227 6379sentinel72 - 10.10.20.161 -- 14162:X 23 Oct 04:39:52.077 # +new-epoch 3sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:52.092 # +selected-slave slave 10.10.21.109:6379 10.10.21.109 6379 @ rails 10.10.10.227 6379sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:52.092 * +failover-state-send-slaveof-noone slave 10.10.21.109:6379 10.10.21.109 6379 @ railsredis15 - 10.10.21.109 -- 17755:M 23 Oct 04:39:52.174 # Connection with master lost.redis15 - 10.10.21.109 -- 17755:M 23 Oct 04:39:52.174 * Caching the disconnected master state.redis15 - 10.10.21.109 -- 17755:M 23 Oct 04:39:52.174 * Discarding previously cached master state.sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:52.175 * +failover-state-wait-promotion slave 10.10.21.109:6379 10.10.21.109 6379 @ railsredis15 - 10.10.21.109 -- 17755:M 23 Oct 04:39:52.184 * MASTER MODE enabled (user request from 'id=12 addr=10.10.30.129:39611 fd=13 name=sentinel-21af3b90-cmd age=494042 idle=0 flags=x db=0 sub=0 psub=0 multi=3 qbuf=0 qbuf-free=32768 obl=36 oll=0 omem=0 events=rw cmd=exec')redis15 - 10.10.21.109 -- 17755:M 23 Oct 04:39:52.195 # CONFIG REWRITE executed with success.sentinel72 - 10.10.20.161 -- 14162:X 23 Oct 04:39:52.407 # +odown master rails 10.10.10.227 6379 #quorum 3/2sentinel72 - 10.10.20.161 -- 14162:X 23 Oct 04:39:52.408 # +new-epoch 4sentinel72 - 10.10.20.161 -- 14162:X 23 Oct 04:39:52.408 # +try-failover master rails 10.10.10.227 6379sentinel72 - 10.10.20.161 -- 14162:X 23 Oct 04:39:52.410 # +vote-for-leader ba42536356be4fe8848b23bdaa8cb1053084aaa5 4sentinel16 - 10.10.10.227 -- 16405:X 23 Oct 04:39:52.420 # +new-epoch 4sentinel16 - 10.10.10.227 -- 16405:X 23 Oct 04:39:52.425 # +vote-for-leader ba42536356be4fe8848b23bdaa8cb1053084aaa5 4sentinel15 - 10.10.21.109 -- 17778:X 23 Oct 04:39:52.427 # +new-epoch 4sentinel72 - 10.10.20.161 -- 14162:X 23 Oct 04:39:52.427 # 10.10.10.227:26379 voted for ba42536356be4fe8848b23bdaa8cb1053084aaa5 4sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:52.434 # +new-epoch 4sentinel72 - 10.10.20.161 -- 14162:X 23 Oct 04:39:52.437 # 10.10.30.129:26379 voted for ba42536356be4fe8848b23bdaa8cb1053084aaa5 4sentinel14 - 10.10.30.129 -- 16404:X 23 Oct 04:39:52.438 # +vote-for-leader ba42536356be4fe8848b23bdaa8cb1053084aaa5 4sentinel15 - 10.10.21.109 -- 17778:X 23 Oct 04:39:52.448 # +vote-for-leader ba42536356be4fe8848b23bdaa8cb1053084aaa5 4sentinel15 - 10.10.21.109 -- 17778:X 23 Oct 04:39:52.448 # +odown master rails 10.10.10.227 6379 #quorum 2/2sentinel15 - 10.10.21.109 -- 17778:X 23 Oct 04:39:52.448 # Next failover delay: I will not start a failover before Sun Oct 23 04:40:53 2016
So, looking at the combined logs, I get an AOF is taking too long message at 04:39:46 on redis and 5 seconds later (down-after-milliseconds value), the sentinels all go into +sdown and +odown immediately, since they all agree.However, I can confirm that there is nothing going on, on that box, at the time. Not CPU, not IO, nothing. So it appears that the aof fsync is causing the master to show unavailable for greater than 5s. I can either set the down-after-ms time to 10s to (hopefully) avoid this or I can set fsync to no. :-/