Replication issue: master_link_status is never up

4,672 views
Skip to first unread message

Alex

unread,
Jan 22, 2015, 5:32:18 AM1/22/15
to redi...@googlegroups.com
Hi guys,

I've got 2 servers with Redis (2.6.17): master and slave. Server 218 is a slave which is connected to master 220. 
Having absolutely fresh setup of slave (all .rdb deleted) I started it and executed 
slaveof 10.xxx.xxx.220 6379


Here is what I have in logs (time on servers is synced):
Slave:
```
[13888] 22 Jan 03:49:12.533 * Connecting to MASTER 10.80.129.220:6379
[13888] 22 Jan 03:49:12.533 * MASTER <-> SLAVE sync started
[13888] 22 Jan 03:49:12.533 * Non blocking connect for SYNC fired the event.
[13888] 22 Jan 03:49:12.533 * Master replied to PING, replication can continue...
[13888] 22 Jan 03:50:18.825 * MASTER <-> SLAVE sync: receiving 3248004619 bytes from master
[14164] 22 Jan 03:50:26.061 * DB saved on disk
[14164] 22 Jan 03:50:26.113 * RDB: 169 MB of memory used by copy-on-write
[13888] 22 Jan 03:50:26.257 * Background saving terminated with success
[13888] 22 Jan 03:51:03.947 * MASTER <-> SLAVE sync: Loading DB in memory
[13888] 22 Jan 03:54:29.243 * MASTER <-> SLAVE sync: Finished with success
[13888] 22 Jan 03:54:29.243 * 50000 changes in 60 seconds. Saving...
[13888] 22 Jan 03:54:29.322 * Background saving started by pid 14185
[13888] 22 Jan 03:54:29.822 * Connecting to MASTER 10.80.129.220:6379
[13888] 22 Jan 03:54:29.822 * MASTER <-> SLAVE sync started
[13888] 22 Jan 03:54:29.822 * Non blocking connect for SYNC fired the event.
[13888] 22 Jan 03:54:29.822 * Master replied to PING, replication can continue...
[13888] 22 Jan 03:55:36.904 * MASTER <-> SLAVE sync: receiving 3247525339 bytes from master
[14185] 22 Jan 03:55:43.456 * DB saved on disk
[14185] 22 Jan 03:55:43.512 * RDB: 140 MB of memory used by copy-on-write
[13888] 22 Jan 03:55:43.633 * Background saving terminated with success
[13888] 22 Jan 03:56:21.720 * MASTER <-> SLAVE sync: Loading DB in memory
[13888] 22 Jan 03:59:48.779 * MASTER <-> SLAVE sync: Finished with success
[13888] 22 Jan 03:59:49.580 * Connecting to MASTER 10.80.129.220:6379
[13888] 22 Jan 03:59:49.580 * MASTER <-> SLAVE sync started
[13888] 22 Jan 03:59:49.580 * Non blocking connect for SYNC fired the event.
[13888] 22 Jan 03:59:49.580 * Master replied to PING, replication can continue...
[13888] 22 Jan 04:00:56.123 * MASTER <-> SLAVE sync: receiving 3247039902 bytes from master
[13888] 22 Jan 04:01:22.030 - Closing idle client
[13888] 22 Jan 04:01:36.456 * MASTER <-> SLAVE sync: Loading DB in memory
[13888] 22 Jan 04:05:02.053 * MASTER <-> SLAVE sync: Finished with success
[13888] 22 Jan 04:05:02.554 * Connecting to MASTER 10.80.129.220:6379
[13888] 22 Jan 04:05:02.554 * MASTER <-> SLAVE sync started
[13888] 22 Jan 04:05:02.554 * Non blocking connect for SYNC fired the event.
[13888] 22 Jan 04:05:02.554 * Master replied to PING, replication can continue...
[13888] 22 Jan 04:06:09.662 * MASTER <-> SLAVE sync: receiving 3246530094 bytes from master
[13888] 22 Jan 04:06:32.101 - Closing idle client
[13888] 22 Jan 04:06:50.065 * MASTER <-> SLAVE sync: Loading DB in memory
[13888] 22 Jan 04:10:15.889 * MASTER <-> SLAVE sync: Finished with success
[13888] 22 Jan 04:10:16.690 * Connecting to MASTER 10.80.129.220:6379
[13888] 22 Jan 04:10:16.690 * MASTER <-> SLAVE sync started
[13888] 22 Jan 04:10:16.690 * Non blocking connect for SYNC fired the event.
[13888] 22 Jan 04:10:16.690 * Master replied to PING, replication can continue...
```

Master:
```
[6027] 22 Jan 03:49:12.529 * Slave ask for synchronization
[6027] 22 Jan 03:49:12.529 * Starting BGSAVE for SYNC
[6027] 22 Jan 03:49:12.559 * Background saving started by pid 28433
[28433] 22 Jan 03:50:18.328 * DB saved on disk
[28433] 22 Jan 03:50:18.346 * RDB: 3645 MB of memory used by copy-on-write
[6027] 22 Jan 03:50:18.474 * Background saving terminated with success
[6027] 22 Jan 03:51:03.550 * Synchronization with slave succeeded
[6027] 22 Jan 03:52:39.088 # Client addr=10.xx.xx.218:38275 fd=77 name= age=207 idle=96 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=4137 omem=99300968 events=rw cmd=sync scheduled to be closed ASAP for overcoming of output buffer limits.
[6027] 22 Jan 03:54:29.819 * Slave ask for synchronization
[6027] 22 Jan 03:54:29.819 * Starting BGSAVE for SYNC
[6027] 22 Jan 03:54:29.849 * Background saving started by pid 28670
[28670] 22 Jan 03:55:36.792 * DB saved on disk
[28670] 22 Jan 03:55:36.811 * RDB: 3679 MB of memory used by copy-on-write
[6027] 22 Jan 03:55:36.900 * Background saving terminated with success
[6027] 22 Jan 03:56:21.316 * Synchronization with slave succeeded
[6027] 22 Jan 03:57:56.056 # Client addr=10.xx.xx.218:38313 fd=113 name= age=207 idle=95 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=4095 omem=98324440 events=rw cmd=sync scheduled to be closed ASAP for overcoming of output buffer limits.
[6027] 22 Jan 03:59:49.577 * Slave ask for synchronization
[6027] 22 Jan 03:59:49.577 * Starting BGSAVE for SYNC
[6027] 22 Jan 03:59:49.607 * Background saving started by pid 28933
[28933] 22 Jan 04:00:56.002 * DB saved on disk
[28933] 22 Jan 04:00:56.021 * RDB: 3470 MB of memory used by copy-on-write
[6027] 22 Jan 04:00:56.119 * Background saving terminated with success
[6027] 22 Jan 04:01:36.052 * Synchronization with slave succeeded
[6027] 22 Jan 04:03:19.029 # Client addr=10.xx.xx.218:38346 fd=59 name= age=210 idle=103 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=4098 omem=98041936 events=rw cmd=sync scheduled to be closed ASAP for overcoming of output buffer limits.
[6027] 22 Jan 04:05:02.551 * Slave ask for synchronization
[6027] 22 Jan 04:05:02.551 * Starting BGSAVE for SYNC
[6027] 22 Jan 04:05:02.584 * Background saving started by pid 29120
[29120] 22 Jan 04:06:09.513 * DB saved on disk
[29120] 22 Jan 04:06:09.532 * RDB: 3659 MB of memory used by copy-on-write
[6027] 22 Jan 04:06:09.640 * Background saving terminated with success
[6027] 22 Jan 04:06:49.673 * Synchronization with slave succeeded
[6027] 22 Jan 04:08:31.083 # Client addr=10.xx.xx.218:38380 fd=211 name= age=209 idle=102 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=4095 omem=98246616 events=rw cmd=sync scheduled to be closed ASAP for overcoming of output buffer limits.
[6027] 22 Jan 04:10:16.688 * Slave ask for synchronization
[6027] 22 Jan 04:10:16.688 * Starting BGSAVE for SYNC
[6027] 22 Jan 04:10:16.719 * Background saving started by pid 29323
[29323] 22 Jan 04:11:23.748 * DB saved on disk
[29323] 22 Jan 04:11:23.767 * RDB: 3631 MB of memory used by copy-on-write
[6027] 22 Jan 04:11:23.850 * Background saving terminated with success
[6027] 22 Jan 04:12:03.681 * Synchronization with slave succeeded
[6027] 22 Jan 04:13:46.009 # Client addr=10.xx.xx.218:38413 fd=499 name= age=210 idle=103 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=4102 omem=98005232 events=rw cmd=sync scheduled to be closed ASAP for overcoming of output buffer limits.
```

INFO on slave keep saying:
```
# Replication
role:slave
master_host:10.80.129.220
master_port:6379
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:1
master_sync_left_bytes:714628316
master_sync_last_io_seconds_ago:0
master_link_down_since_seconds:100
slave_priority:100
slave_read_only:1
connected_slaves:0
```
or sometimes INFO stuck for a minute and produce the following:
```
# Replication
role:slave
master_host:10.80.129.220
master_port:6379
master_link_status:down
master_last_io_seconds_ago:-1
master_sync_in_progress:1
master_sync_left_bytes:0
master_sync_last_io_seconds_ago:0
master_link_down_since_seconds:108
slave_priority:100
slave_read_only:1
connected_slaves:0
```

By running INFO on slave multiple times I found out that amount of keys is going from 0 to real master's amount and than it starts again from 0.
Is it related to Master's message "Client ... scheduled to be closed ASAP for overcoming of output buffer limits"? According to Slave's log - things are simply perfect, but replica is never up...

Also, running INFO on master produces different message from time to time:
```
# Replication
role:master
connected_slaves:0
```
or
```
# Replication
role:master
connected_slaves:1
slave0:10.80.129.218,6379,send_bulk
```

Some info about master:
```
# Memory
used_memory:8608876648
used_memory_human:8.02G
used_memory_rss:8940982272
used_memory_peak:8710743400
used_memory_peak_human:8.11G
used_memory_lua:31744
mem_fragmentation_ratio:1.04
mem_allocator:jemalloc-3.2.0
```

Slave had default SAVE settings, but I disabled them during my tests. Both servers have the same SAVE configuration:

CONFIG GET save
1) "save"
2) ""

Any ideas?

Thanks in advance,
Alex

Josiah Carlson

unread,
Jan 22, 2015, 5:51:04 PM1/22/15
to redi...@googlegroups.com
While loading the RDB from the master, your master attempts to write in total about 100 megs worth of backlogged writes to the slave over 1.5 minutes. Because your slave isn't reading that data, being blocked loading the just-transferred snapshot for 2.5 minutes, your master thinks that it isn't able to keep up, so it closes the connection. Which forces your slave to reconnect. Lather, rinse, repeat.

You should look at your settings for "client-output-buffer-limit slave". If you're seeing disconnects after ~90 seconds and 100 megs, maybe set the soft limit to be 200 megs and 180 seconds, hard limit at 400 megs.

 - Josiah


--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.

Alex

unread,
Jan 23, 2015, 3:37:51 AM1/23/15
to redi...@googlegroups.com
Josiah,

Yep, you are right. I actually found the solution faster than my post was approved to appear here :) Increasing buffers on master to 512Mb solved the issue. 
Well, there is still an issue - logs aren't really informative. I would expect some message on slave saying about reject. And a message on master about full buffer and the fact of disconnect. Currently there is only 1 message saying "Client ... scheduled to be closed ASAP for overcoming of output buffer limits". 

Maybe someone will find it useful. 

Alex
Reply all
Reply to author
Forward
0 new messages