Continuously losing master and slave connection

441 views
Skip to first unread message

gon...@gmail.com

unread,
Apr 22, 2014, 10:39:56 PM4/22/14
to redi...@googlegroups.com
Hello, 
I had to turn backlog to 500mb in order to avoid downtime. Can't understand why server sends a huge amount of resync data with no problem, but has trouble answering a PING request in 10 seconds or 60.
I'm running 16 master servers in 4 ec2 instances, and a slave for each one in two other ec2. Only having trouble with one since 2.8.8

Am I missing something? How can I tackle this? Maybe related to this issue https://github.com/antirez/redis/issues/1650 ?
I'm having almost 20k operations per redis instance. 

Currenty this is my info memory in master:
used_memory:14147782848
used_memory_human:13.18G
used_memory_rss:14405611520
used_memory_peak:18898750272
used_memory_peak_human:17.60G
used_memory_lua:33792
mem_fragmentation_ratio:1.02
mem_allocator:jemalloc-3.2.0

config get client-output-buffer-limit in both servers
1) "client-output-buffer-limit"
2) "normal 0 0 0 slave 0 0 0 pubsub 33554432 8388608 60"

We don't use pubsub


Master conf:

port 6300
pidfile /var/run/redis_0.pid
logfile /var/log/redis_0.log
dbfilename dump_0.rdb
appendfilename appendonly_0.aof
daemonize yes
timeout 600
dir /var/lib/redis
databases 16
rdbcompression yes
slave-serve-stale-data yes
requirepass -----------------------------------------
appendonly no
appendfsync no
no-appendfsync-on-rewrite yes
auto-aof-rewrite-percentage 0
auto-aof-rewrite-min-size 64mb
slowlog-log-slower-than 10000
slowlog-max-len 1024
hash-max-ziplist-entries 256
hash-max-ziplist-value 1024
list-max-ziplist-entries 256
list-max-ziplist-value 1024
set-max-intset-entries 512
zset-max-ziplist-entries 256
zset-max-ziplist-value 1024
activerehashing yes
repl-backlog-size 500mb

Log from master:

[115250] 22 Apr 22:49:24.224 # Connection with slave 10.155.196.174:6300 lost.
[115250] 22 Apr 22:49:24.278 * Slave asks for synchronization
[115250] 22 Apr 22:49:24.709 * Partial resynchronization request accepted. Sending 475575343 bytes of backlog starting from offset 3341440037224.
[115250] 22 Apr 22:52:42.252 # Connection with slave 10.155.196.174:6300 lost.
[115250] 22 Apr 22:52:42.336 * Slave asks for synchronization
[115250] 22 Apr 22:52:42.681 * Partial resynchronization request accepted. Sending 484955663 bytes of backlog starting from offset 3342983118227.
[115250] 22 Apr 22:53:47.287 # Connection with slave 10.155.196.174:6300 lost.
[115250] 22 Apr 22:53:47.340 * Slave asks for synchronization
[115250] 22 Apr 22:53:47.664 * Partial resynchronization request accepted. Sending 476326200 bytes of backlog starting from offset 3343499453628.
[115250] 22 Apr 22:54:55.313 # Connection with slave 10.155.196.174:6300 lost.
[115250] 22 Apr 22:54:55.369 * Slave asks for synchronization
[115250] 22 Apr 22:54:55.731 * Partial resynchronization request accepted. Sending 518925163 bytes of backlog starting from offset 3344035506726.


Slave conf:

port 6300
pidfile /var/run/redis_0_slave.pid
logfile /var/log/redis_0_slave.log
dbfilename dump_0_slave.rdb
appendfilename appendonly_0_slave.aof
daemonize yes
timeout 600
dir /var/lib/redis
databases 16
rdbcompression yes
slave-serve-stale-data yes
requirepass -------------------------------------------------------
appendonly no
appendfsync no
no-appendfsync-on-rewrite yes
auto-aof-rewrite-percentage 0
auto-aof-rewrite-min-size 64mb
slowlog-log-slower-than 10000
slowlog-max-len 1024
hash-max-ziplist-entries 256
hash-max-ziplist-value 512
list-max-ziplist-entries 256
list-max-ziplist-value 512
set-max-intset-entries 512
zset-max-ziplist-entries 256
zset-max-ziplist-value 512
activerehashing yes
slaveof 10.155.245.132 6300
masterauth -------------------------------------------------------


Log from slave:

[3911] 22 Apr 21:49:41.357 # MASTER timeout: no data nor PING received...
[3911] 22 Apr 21:49:41.357 # Connection with master lost.
[3911] 22 Apr 21:49:41.357 * Caching the disconnected master state.
[3911] 22 Apr 21:49:41.357 * Connecting to MASTER 10.155.245.132:6300
[3911] 22 Apr 21:49:41.357 * MASTER <-> SLAVE sync started
[3911] 22 Apr 21:49:41.359 * Non blocking connect for SYNC fired the event.
[3911] 22 Apr 21:49:41.407 * Master replied to PING, replication can continue...
[3911] 22 Apr 21:49:41.410 * Trying a partial resynchronization (request f85250bedc4b4492c0b2531dcddd01439ce289fc:3341440037224).
[3911] 22 Apr 21:49:41.412 * Successful partial resynchronization with master.
[3911] 22 Apr 21:49:41.413 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.
[3911] 22 Apr 21:52:59.391 # MASTER timeout: no data nor PING received...
[3911] 22 Apr 21:52:59.391 # Connection with master lost.
[3911] 22 Apr 21:52:59.391 * Caching the disconnected master state.
[3911] 22 Apr 21:52:59.391 * Connecting to MASTER 10.155.245.132:6300
[3911] 22 Apr 21:52:59.391 * MASTER <-> SLAVE sync started
[3911] 22 Apr 21:52:59.392 * Non blocking connect for SYNC fired the event.
[3911] 22 Apr 21:52:59.472 * Master replied to PING, replication can continue...
[3911] 22 Apr 21:52:59.475 * Trying a partial resynchronization (request f85250bedc4b4492c0b2531dcddd01439ce289fc:3342983118227).
[3911] 22 Apr 21:52:59.477 * Successful partial resynchronization with master.
[3911] 22 Apr 21:52:59.477 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.
[3911] 22 Apr 21:54:04.425 # MASTER timeout: no data nor PING received...
[3911] 22 Apr 21:54:04.425 # Connection with master lost.
[3911] 22 Apr 21:54:04.425 * Caching the disconnected master state.
[3911] 22 Apr 21:54:04.425 * Connecting to MASTER 10.155.245.132:6300
[3911] 22 Apr 21:54:04.425 * MASTER <-> SLAVE sync started
[3911] 22 Apr 21:54:04.426 * Non blocking connect for SYNC fired the event.
[3911] 22 Apr 21:54:04.474 * Master replied to PING, replication can continue...
[3911] 22 Apr 21:54:04.477 * Trying a partial resynchronization (request f85250bedc4b4492c0b2531dcddd01439ce289fc:3343499453628).
[3911] 22 Apr 21:54:04.479 * Successful partial resynchronization with master.
[3911] 22 Apr 21:54:04.479 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.

Josiah Carlson

unread,
Apr 23, 2014, 2:05:50 PM4/23/14
to redi...@googlegroups.com
Replace your slave VM. I'd wager it being a noisy neighbor (eating resources), partial hardware failure between the two machines (I have had two VMs not able to communicate with each other, but able to communicate with every other machine), partial hardware failure on one machine, or something similar.

 - Josiah


--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.

Gonzalo Garcia

unread,
Apr 23, 2014, 2:10:07 PM4/23/14
to redi...@googlegroups.com
Thanks, we are now migrating all Redis to a VPC with same placement group and enhanced networking (access to network driver). We also are changing to new R3 instances which have better RAM bandwith. They are optimized for Redis.

We've gather quite a lot experience regarding Redis in EC2 Amazon instances.
Altogether we have almost 1.2 million requests per second in peak hour.


--
You received this message because you are subscribed to a topic in the Google Groups "Redis DB" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/redis-db/DXR2-f5X2Mw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to redis-db+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages