Continuously losing master and slave connection

gon...@gmail.com

unread,

Apr 22, 2014, 10:39:56 PM4/22/14

to redi...@googlegroups.com

Hello,

I had to turn backlog to 500mb in order to avoid downtime. Can't understand why server sends a huge amount of resync data with no problem, but has trouble answering a PING request in 10 seconds or 60.

I'm running 16 master servers in 4 ec2 instances, and a slave for each one in two other ec2. Only having trouble with one since 2.8.8

Am I missing something? How can I tackle this? Maybe related to this issue https://github.com/antirez/redis/issues/1650 ?

I'm having almost 20k operations per redis instance.

Currenty this is my info memory in master:
used_memory:14147782848
used_memory_human:13.18G
used_memory_rss:14405611520
used_memory_peak:18898750272
used_memory_peak_human:17.60G
used_memory_lua:33792
mem_fragmentation_ratio:1.02
mem_allocator:jemalloc-3.2.0

config get client-output-buffer-limit in both servers
1) "client-output-buffer-limit"
2) "normal 0 0 0 slave 0 0 0 pubsub 33554432 8388608 60"

We don't use pubsub

Master conf:

port 6300

pidfile /var/run/redis_0.pid

logfile /var/log/redis_0.log

dbfilename dump_0.rdb

appendfilename appendonly_0.aof

daemonize yes

timeout 600

dir /var/lib/redis

databases 16

rdbcompression yes

slave-serve-stale-data yes

requirepass -----------------------------------------

appendonly no

appendfsync no

no-appendfsync-on-rewrite yes

auto-aof-rewrite-percentage 0

auto-aof-rewrite-min-size 64mb

slowlog-log-slower-than 10000

slowlog-max-len 1024

hash-max-ziplist-entries 256

hash-max-ziplist-value 1024

list-max-ziplist-entries 256

list-max-ziplist-value 1024

set-max-intset-entries 512

zset-max-ziplist-entries 256

zset-max-ziplist-value 1024

activerehashing yes

repl-backlog-size 500mb

Log from master:

[115250] 22 Apr 22:49:24.224 # Connection with slave 10.155.196.174:6300 lost.

[115250] 22 Apr 22:49:24.278 * Slave asks for synchronization

[115250] 22 Apr 22:49:24.709 * Partial resynchronization request accepted. Sending 475575343 bytes of backlog starting from offset 3341440037224.

[115250] 22 Apr 22:52:42.252 # Connection with slave 10.155.196.174:6300 lost.

[115250] 22 Apr 22:52:42.336 * Slave asks for synchronization

[115250] 22 Apr 22:52:42.681 * Partial resynchronization request accepted. Sending 484955663 bytes of backlog starting from offset 3342983118227.

[115250] 22 Apr 22:53:47.287 # Connection with slave 10.155.196.174:6300 lost.

[115250] 22 Apr 22:53:47.340 * Slave asks for synchronization

[115250] 22 Apr 22:53:47.664 * Partial resynchronization request accepted. Sending 476326200 bytes of backlog starting from offset 3343499453628.

[115250] 22 Apr 22:54:55.313 # Connection with slave 10.155.196.174:6300 lost.

[115250] 22 Apr 22:54:55.369 * Slave asks for synchronization

[115250] 22 Apr 22:54:55.731 * Partial resynchronization request accepted. Sending 518925163 bytes of backlog starting from offset 3344035506726.

Slave conf:

port 6300

pidfile /var/run/redis_0_slave.pid

logfile /var/log/redis_0_slave.log

dbfilename dump_0_slave.rdb

appendfilename appendonly_0_slave.aof

daemonize yes

timeout 600

dir /var/lib/redis

databases 16

rdbcompression yes

slave-serve-stale-data yes

requirepass -------------------------------------------------------

appendonly no

appendfsync no

no-appendfsync-on-rewrite yes

auto-aof-rewrite-percentage 0

auto-aof-rewrite-min-size 64mb

slowlog-log-slower-than 10000

slowlog-max-len 1024

hash-max-ziplist-entries 256

hash-max-ziplist-value 512

list-max-ziplist-entries 256

list-max-ziplist-value 512

set-max-intset-entries 512

zset-max-ziplist-entries 256

zset-max-ziplist-value 512

activerehashing yes

slaveof 10.155.245.132 6300

masterauth -------------------------------------------------------

Log from slave:

[3911] 22 Apr 21:49:41.357 # MASTER timeout: no data nor PING received...

[3911] 22 Apr 21:49:41.357 # Connection with master lost.

[3911] 22 Apr 21:49:41.357 * Caching the disconnected master state.

[3911] 22 Apr 21:49:41.357 * Connecting to MASTER 10.155.245.132:6300

[3911] 22 Apr 21:49:41.357 * MASTER <-> SLAVE sync started

[3911] 22 Apr 21:49:41.359 * Non blocking connect for SYNC fired the event.

[3911] 22 Apr 21:49:41.407 * Master replied to PING, replication can continue...

[3911] 22 Apr 21:49:41.410 * Trying a partial resynchronization (request f85250bedc4b4492c0b2531dcddd01439ce289fc:3341440037224).

[3911] 22 Apr 21:49:41.412 * Successful partial resynchronization with master.

[3911] 22 Apr 21:49:41.413 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.

[3911] 22 Apr 21:52:59.391 # MASTER timeout: no data nor PING received...

[3911] 22 Apr 21:52:59.391 # Connection with master lost.

[3911] 22 Apr 21:52:59.391 * Caching the disconnected master state.

[3911] 22 Apr 21:52:59.391 * Connecting to MASTER 10.155.245.132:6300

[3911] 22 Apr 21:52:59.391 * MASTER <-> SLAVE sync started

[3911] 22 Apr 21:52:59.392 * Non blocking connect for SYNC fired the event.

[3911] 22 Apr 21:52:59.472 * Master replied to PING, replication can continue...

[3911] 22 Apr 21:52:59.475 * Trying a partial resynchronization (request f85250bedc4b4492c0b2531dcddd01439ce289fc:3342983118227).

[3911] 22 Apr 21:52:59.477 * Successful partial resynchronization with master.

[3911] 22 Apr 21:52:59.477 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.

[3911] 22 Apr 21:54:04.425 # MASTER timeout: no data nor PING received...

[3911] 22 Apr 21:54:04.425 # Connection with master lost.

[3911] 22 Apr 21:54:04.425 * Caching the disconnected master state.

[3911] 22 Apr 21:54:04.425 * Connecting to MASTER 10.155.245.132:6300

[3911] 22 Apr 21:54:04.425 * MASTER <-> SLAVE sync started

[3911] 22 Apr 21:54:04.426 * Non blocking connect for SYNC fired the event.

[3911] 22 Apr 21:54:04.474 * Master replied to PING, replication can continue...

[3911] 22 Apr 21:54:04.477 * Trying a partial resynchronization (request f85250bedc4b4492c0b2531dcddd01439ce289fc:3343499453628).

[3911] 22 Apr 21:54:04.479 * Successful partial resynchronization with master.

[3911] 22 Apr 21:54:04.479 * MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.

Josiah Carlson

unread,

Apr 23, 2014, 2:05:50 PM4/23/14

to redi...@googlegroups.com

Replace your slave VM. I'd wager it being a noisy neighbor (eating resources), partial hardware failure between the two machines (I have had two VMs not able to communicate with each other, but able to communicate with every other machine), partial hardware failure on one machine, or something similar.

- Josiah

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.

Gonzalo Garcia

unread,

Apr 23, 2014, 2:10:07 PM4/23/14

to redi...@googlegroups.com

Thanks, we are now migrating all Redis to a VPC with same placement group and enhanced networking (access to network driver). We also are changing to new R3 instances which have better RAM bandwith. They are optimized for Redis.

We've gather quite a lot experience regarding Redis in EC2 Amazon instances.

Altogether we have almost 1.2 million requests per second in peak hour.

--
You received this message because you are subscribed to a topic in the Google Groups "Redis DB" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/redis-db/DXR2-f5X2Mw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to redis-db+u...@googlegroups.com.

Reply all

Reply to author

Forward