Replication issue: master_link_status is never up

4,692 views

Skip to first unread message

Alex

unread,

Jan 22, 2015, 5:32:18 AM1/22/15

to redi...@googlegroups.com

Hi guys,

I've got 2 servers with Redis (2.6.17): master and slave. Server 218 is a slave which is connected to master 220.

Having absolutely fresh setup of slave (all .rdb deleted) I started it and executed

slaveof 10.xxx.xxx.220 6379

Here is what I have in logs (time on servers is synced):

Slave:

```

[13888] 22 Jan 03:49:12.533 * Connecting to MASTER 10.80.129.220:6379

[13888] 22 Jan 03:49:12.533 * MASTER <-> SLAVE sync started

[13888] 22 Jan 03:49:12.533 * Non blocking connect for SYNC fired the event.

[13888] 22 Jan 03:49:12.533 * Master replied to PING, replication can continue...

[13888] 22 Jan 03:50:18.825 * MASTER <-> SLAVE sync: receiving 3248004619 bytes from master

[14164] 22 Jan 03:50:26.061 * DB saved on disk

[14164] 22 Jan 03:50:26.113 * RDB: 169 MB of memory used by copy-on-write

[13888] 22 Jan 03:50:26.257 * Background saving terminated with success

[13888] 22 Jan 03:51:03.947 * MASTER <-> SLAVE sync: Loading DB in memory

[13888] 22 Jan 03:54:29.243 * MASTER <-> SLAVE sync: Finished with success

[13888] 22 Jan 03:54:29.243 * 50000 changes in 60 seconds. Saving...

[13888] 22 Jan 03:54:29.322 * Background saving started by pid 14185

[13888] 22 Jan 03:54:29.822 * Connecting to MASTER 10.80.129.220:6379

[13888] 22 Jan 03:54:29.822 * MASTER <-> SLAVE sync started

[13888] 22 Jan 03:54:29.822 * Non blocking connect for SYNC fired the event.

[13888] 22 Jan 03:54:29.822 * Master replied to PING, replication can continue...

[13888] 22 Jan 03:55:36.904 * MASTER <-> SLAVE sync: receiving 3247525339 bytes from master

[14185] 22 Jan 03:55:43.456 * DB saved on disk

[14185] 22 Jan 03:55:43.512 * RDB: 140 MB of memory used by copy-on-write

[13888] 22 Jan 03:55:43.633 * Background saving terminated with success

[13888] 22 Jan 03:56:21.720 * MASTER <-> SLAVE sync: Loading DB in memory

[13888] 22 Jan 03:59:48.779 * MASTER <-> SLAVE sync: Finished with success

[13888] 22 Jan 03:59:49.580 * Connecting to MASTER 10.80.129.220:6379

[13888] 22 Jan 03:59:49.580 * MASTER <-> SLAVE sync started

[13888] 22 Jan 03:59:49.580 * Non blocking connect for SYNC fired the event.

[13888] 22 Jan 03:59:49.580 * Master replied to PING, replication can continue...

[13888] 22 Jan 04:00:56.123 * MASTER <-> SLAVE sync: receiving 3247039902 bytes from master

[13888] 22 Jan 04:01:22.030 - Closing idle client

[13888] 22 Jan 04:01:36.456 * MASTER <-> SLAVE sync: Loading DB in memory

[13888] 22 Jan 04:05:02.053 * MASTER <-> SLAVE sync: Finished with success

[13888] 22 Jan 04:05:02.554 * Connecting to MASTER 10.80.129.220:6379

[13888] 22 Jan 04:05:02.554 * MASTER <-> SLAVE sync started

[13888] 22 Jan 04:05:02.554 * Non blocking connect for SYNC fired the event.

[13888] 22 Jan 04:05:02.554 * Master replied to PING, replication can continue...

[13888] 22 Jan 04:06:09.662 * MASTER <-> SLAVE sync: receiving 3246530094 bytes from master

[13888] 22 Jan 04:06:32.101 - Closing idle client

[13888] 22 Jan 04:06:50.065 * MASTER <-> SLAVE sync: Loading DB in memory

[13888] 22 Jan 04:10:15.889 * MASTER <-> SLAVE sync: Finished with success

[13888] 22 Jan 04:10:16.690 * Connecting to MASTER 10.80.129.220:6379

[13888] 22 Jan 04:10:16.690 * MASTER <-> SLAVE sync started

[13888] 22 Jan 04:10:16.690 * Non blocking connect for SYNC fired the event.

[13888] 22 Jan 04:10:16.690 * Master replied to PING, replication can continue...

```

Master:

```

[6027] 22 Jan 03:49:12.529 * Slave ask for synchronization

[6027] 22 Jan 03:49:12.529 * Starting BGSAVE for SYNC

[6027] 22 Jan 03:49:12.559 * Background saving started by pid 28433

[28433] 22 Jan 03:50:18.328 * DB saved on disk

[28433] 22 Jan 03:50:18.346 * RDB: 3645 MB of memory used by copy-on-write

[6027] 22 Jan 03:50:18.474 * Background saving terminated with success

[6027] 22 Jan 03:51:03.550 * Synchronization with slave succeeded

[6027] 22 Jan 03:52:39.088 # Client addr=10.xx.xx.218:38275 fd=77 name= age=207 idle=96 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=4137 omem=99300968 events=rw cmd=sync scheduled to be closed ASAP for overcoming of output buffer limits.

[6027] 22 Jan 03:54:29.819 * Slave ask for synchronization

[6027] 22 Jan 03:54:29.819 * Starting BGSAVE for SYNC

[6027] 22 Jan 03:54:29.849 * Background saving started by pid 28670

[28670] 22 Jan 03:55:36.792 * DB saved on disk

[28670] 22 Jan 03:55:36.811 * RDB: 3679 MB of memory used by copy-on-write

[6027] 22 Jan 03:55:36.900 * Background saving terminated with success

[6027] 22 Jan 03:56:21.316 * Synchronization with slave succeeded

[6027] 22 Jan 03:57:56.056 # Client addr=10.xx.xx.218:38313 fd=113 name= age=207 idle=95 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=4095 omem=98324440 events=rw cmd=sync scheduled to be closed ASAP for overcoming of output buffer limits.

[6027] 22 Jan 03:59:49.577 * Slave ask for synchronization

[6027] 22 Jan 03:59:49.577 * Starting BGSAVE for SYNC

[6027] 22 Jan 03:59:49.607 * Background saving started by pid 28933

[28933] 22 Jan 04:00:56.002 * DB saved on disk

[28933] 22 Jan 04:00:56.021 * RDB: 3470 MB of memory used by copy-on-write

[6027] 22 Jan 04:00:56.119 * Background saving terminated with success

[6027] 22 Jan 04:01:36.052 * Synchronization with slave succeeded

[6027] 22 Jan 04:03:19.029 # Client addr=10.xx.xx.218:38346 fd=59 name= age=210 idle=103 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=4098 omem=98041936 events=rw cmd=sync scheduled to be closed ASAP for overcoming of output buffer limits.

[6027] 22 Jan 04:05:02.551 * Slave ask for synchronization

[6027] 22 Jan 04:05:02.551 * Starting BGSAVE for SYNC

[6027] 22 Jan 04:05:02.584 * Background saving started by pid 29120

[29120] 22 Jan 04:06:09.513 * DB saved on disk

[29120] 22 Jan 04:06:09.532 * RDB: 3659 MB of memory used by copy-on-write

[6027] 22 Jan 04:06:09.640 * Background saving terminated with success

[6027] 22 Jan 04:06:49.673 * Synchronization with slave succeeded

[6027] 22 Jan 04:08:31.083 # Client addr=10.xx.xx.218:38380 fd=211 name= age=209 idle=102 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=4095 omem=98246616 events=rw cmd=sync scheduled to be closed ASAP for overcoming of output buffer limits.

[6027] 22 Jan 04:10:16.688 * Slave ask for synchronization

[6027] 22 Jan 04:10:16.688 * Starting BGSAVE for SYNC

[6027] 22 Jan 04:10:16.719 * Background saving started by pid 29323

[29323] 22 Jan 04:11:23.748 * DB saved on disk

[29323] 22 Jan 04:11:23.767 * RDB: 3631 MB of memory used by copy-on-write

[6027] 22 Jan 04:11:23.850 * Background saving terminated with success

[6027] 22 Jan 04:12:03.681 * Synchronization with slave succeeded

[6027] 22 Jan 04:13:46.009 # Client addr=10.xx.xx.218:38413 fd=499 name= age=210 idle=103 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=0 oll=4102 omem=98005232 events=rw cmd=sync scheduled to be closed ASAP for overcoming of output buffer limits.

```

INFO on slave keep saying:

```

# Replication

role:slave

master_host:10.80.129.220

master_port:6379

master_link_status:down

master_last_io_seconds_ago:-1

master_sync_in_progress:1

master_sync_left_bytes:714628316

master_sync_last_io_seconds_ago:0

master_link_down_since_seconds:100

slave_priority:100

slave_read_only:1

connected_slaves:0

```

or sometimes INFO stuck for a minute and produce the following:

```

# Replication

role:slave

master_host:10.80.129.220

master_port:6379

master_link_status:down

master_last_io_seconds_ago:-1

master_sync_in_progress:1

master_sync_left_bytes:0

master_sync_last_io_seconds_ago:0

master_link_down_since_seconds:108

slave_priority:100

slave_read_only:1

connected_slaves:0

```

By running INFO on slave multiple times I found out that amount of keys is going from 0 to real master's amount and than it starts again from 0.

Is it related to Master's message "Client ... scheduled to be closed ASAP for overcoming of output buffer limits"? According to Slave's log - things are simply perfect, but replica is never up...

Also, running INFO on master produces different message from time to time:

```

# Replication

role:master

connected_slaves:0

```

# Replication

role:master

connected_slaves:1

slave0:10.80.129.218,6379,send_bulk

```

Some info about master:

```

# Memory

used_memory:8608876648

used_memory_human:8.02G

used_memory_rss:8940982272

used_memory_peak:8710743400

used_memory_peak_human:8.11G

used_memory_lua:31744

mem_fragmentation_ratio:1.04

mem_allocator:jemalloc-3.2.0

```

Slave had default SAVE settings, but I disabled them during my tests. Both servers have the same SAVE configuration:

CONFIG GET save

1) "save"

2) ""

Any ideas?

Thanks in advance,

Alex

Josiah Carlson

unread,

Jan 22, 2015, 5:51:04 PM1/22/15

to redi...@googlegroups.com

While loading the RDB from the master, your master attempts to write in total about 100 megs worth of backlogged writes to the slave over 1.5 minutes. Because your slave isn't reading that data, being blocked loading the just-transferred snapshot for 2.5 minutes, your master thinks that it isn't able to keep up, so it closes the connection. Which forces your slave to reconnect. Lather, rinse, repeat.

You should look at your settings for "client-output-buffer-limit slave". If you're seeing disconnects after ~90 seconds and 100 megs, maybe set the soft limit to be 200 megs and 180 seconds, hard limit at 400 megs.

- Josiah

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.
To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.

Alex

unread,

Jan 23, 2015, 3:37:51 AM1/23/15

to redi...@googlegroups.com

Josiah,

Yep, you are right. I actually found the solution faster than my post was approved to appear here :) Increasing buffers on master to 512Mb solved the issue.

Well, there is still an issue - logs aren't really informative. I would expect some message on slave saying about reject. And a message on master about full buffer and the fact of disconnect. Currently there is only 1 message saying "Client ... scheduled to be closed ASAP for overcoming of output buffer limits".

Here are nice posts about mentioned issue: http://java.dzone.com/articles/top-redis-headaches-devops and https://redislabs.com/blog/the-endless-redis-replication-loop-what-why-and-how-to-solve-it

Maybe someone will find it useful.

Alex

Reply all

Reply to author

Forward

0 new messages