I had to turn backlog to 500mb in order to avoid downtime. Can't understand why server sends a huge amount of resync data with no problem, but has trouble answering a PING request in 10 seconds or 60.
I'm running 16 master servers in 4 ec2 instances, and a slave for each one in two other ec2. Only having trouble with one since 2.8.8
I'm having almost 20k operations per redis instance.
Currenty this is my info memory in master:
used_memory:14147782848
used_memory_human:13.18G
used_memory_rss:14405611520
used_memory_peak:18898750272
used_memory_peak_human:17.60G
used_memory_lua:33792
mem_fragmentation_ratio:1.02
mem_allocator:jemalloc-3.2.0
config get client-output-buffer-limit in both servers
1) "client-output-buffer-limit"
2) "normal 0 0 0 slave 0 0 0 pubsub 33554432 8388608 60"
We don't use pubsub
Master conf:
port 6300
pidfile /var/run/redis_0.pid
logfile /var/log/redis_0.log
dbfilename dump_0.rdb
appendfilename appendonly_0.aof
daemonize yes
timeout 600
dir /var/lib/redis
databases 16
rdbcompression yes
slave-serve-stale-data yes
requirepass -----------------------------------------
appendonly no
appendfsync no
no-appendfsync-on-rewrite yes
auto-aof-rewrite-percentage 0
auto-aof-rewrite-min-size 64mb
slowlog-log-slower-than 10000
slowlog-max-len 1024
hash-max-ziplist-entries 256
hash-max-ziplist-value 1024
list-max-ziplist-entries 256
list-max-ziplist-value 1024
set-max-intset-entries 512
zset-max-ziplist-entries 256
zset-max-ziplist-value 1024
activerehashing yes
repl-backlog-size 500mb
Log from master:
[115250] 22 Apr 22:49:24.278 * Slave asks for synchronization
[115250] 22 Apr 22:49:24.709 * Partial resynchronization request accepted. Sending 475575343 bytes of backlog starting from offset 3341440037224.
[115250] 22 Apr 22:52:42.336 * Slave asks for synchronization
[115250] 22 Apr 22:52:42.681 * Partial resynchronization request accepted. Sending 484955663 bytes of backlog starting from offset 3342983118227.
[115250] 22 Apr 22:53:47.340 * Slave asks for synchronization
[115250] 22 Apr 22:53:47.664 * Partial resynchronization request accepted. Sending 476326200 bytes of backlog starting from offset 3343499453628.
[115250] 22 Apr 22:54:55.369 * Slave asks for synchronization
[115250] 22 Apr 22:54:55.731 * Partial resynchronization request accepted. Sending 518925163 bytes of backlog starting from offset 3344035506726.
Slave conf:
port 6300
pidfile /var/run/redis_0_slave.pid
logfile /var/log/redis_0_slave.log
dbfilename dump_0_slave.rdb
appendfilename appendonly_0_slave.aof
daemonize yes
timeout 600
dir /var/lib/redis
databases 16
rdbcompression yes
slave-serve-stale-data yes
requirepass -------------------------------------------------------
appendonly no
appendfsync no
no-appendfsync-on-rewrite yes
auto-aof-rewrite-percentage 0
auto-aof-rewrite-min-size 64mb
slowlog-log-slower-than 10000
slowlog-max-len 1024
hash-max-ziplist-entries 256
hash-max-ziplist-value 512
list-max-ziplist-entries 256
list-max-ziplist-value 512
set-max-intset-entries 512
zset-max-ziplist-entries 256
zset-max-ziplist-value 512
activerehashing yes
slaveof 10.155.245.132 6300
masterauth -------------------------------------------------------
Log from slave:
[3911] 22 Apr 21:49:41.357 # MASTER timeout: no data nor PING received...
[3911] 22 Apr 21:49:41.357 # Connection with master lost.
[3911] 22 Apr 21:49:41.357 * Caching the disconnected master state.
[3911] 22 Apr 21:49:41.357 * MASTER <-> SLAVE sync started
[3911] 22 Apr 21:49:41.359 * Non blocking connect for SYNC fired the event.
[3911] 22 Apr 21:49:41.407 * Master replied to PING, replication can continue...
[3911] 22 Apr 21:49:41.410 * Trying a partial resynchronization (request f85250bedc4b4492c0b2531dcddd01439ce289fc:3341440037224).
[3911] 22 Apr 21:49:41.412 * Successful partial resynchronization with master.
[3911] 22 Apr 21:49:41.413 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.
[3911] 22 Apr 21:52:59.391 # MASTER timeout: no data nor PING received...
[3911] 22 Apr 21:52:59.391 # Connection with master lost.
[3911] 22 Apr 21:52:59.391 * Caching the disconnected master state.
[3911] 22 Apr 21:52:59.391 * MASTER <-> SLAVE sync started
[3911] 22 Apr 21:52:59.392 * Non blocking connect for SYNC fired the event.
[3911] 22 Apr 21:52:59.472 * Master replied to PING, replication can continue...
[3911] 22 Apr 21:52:59.475 * Trying a partial resynchronization (request f85250bedc4b4492c0b2531dcddd01439ce289fc:3342983118227).
[3911] 22 Apr 21:52:59.477 * Successful partial resynchronization with master.
[3911] 22 Apr 21:52:59.477 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.
[3911] 22 Apr 21:54:04.425 # MASTER timeout: no data nor PING received...
[3911] 22 Apr 21:54:04.425 # Connection with master lost.
[3911] 22 Apr 21:54:04.425 * Caching the disconnected master state.
[3911] 22 Apr 21:54:04.425 * MASTER <-> SLAVE sync started
[3911] 22 Apr 21:54:04.426 * Non blocking connect for SYNC fired the event.
[3911] 22 Apr 21:54:04.474 * Master replied to PING, replication can continue...
[3911] 22 Apr 21:54:04.477 * Trying a partial resynchronization (request f85250bedc4b4492c0b2531dcddd01439ce289fc:3343499453628).
[3911] 22 Apr 21:54:04.479 * Successful partial resynchronization with master.
[3911] 22 Apr 21:54:04.479 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.