Re: client-output-buffer-limit problems with slaving

5,507 views
Skip to first unread message

Josiah Carlson

unread,
Sep 14, 2012, 11:31:47 AM9/14/12
to redi...@googlegroups.com
Your slave configuration is:
slave 268435456 67108864 60

The leftmost number is a hard limit, if your slave passes that, it is
disconnected immediately.
The middle number is a soft limit, if your slave passes that for more
than the 3rd number's number of seconds (in this case 60 seconds), it
will be disconnected (it is currently 64MB).

In this case, it would seem that after the initial sync started, it
couldn't get below your soft limit before 60 seconds. You could
increase your soft limit timeout, or increase your soft limit, or
both.

Regards,
- Josiah

On Thu, Sep 13, 2012 at 3:47 PM, tudor <t.a...@gmail.com> wrote:
> Hi all- I'm running into a slaving problem where the master will shut down a
> slave's output buffer before the client-output-buffer-limit is reached.
> Here's the output from the master's log:
>
> [9470] 13 Sep 22:24:04.789 * Slave ask for synchronization
> [9470] 13 Sep 22:24:04.789 * Starting BGSAVE for SYNC
> [9470] 13 Sep 22:24:09.454 * Background saving started by pid 18435
> [9470] 13 Sep 22:26:37.105 # Client addr=10.108.61.163:44422 fd=157 age=153
> idle=153 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=8361
> oll=3975 omem=100347928 events=r cmd=sync scheduled to be closed ASAP for
> overcoming of output buffer limits.
> [18435] 13 Sep 22:29:06.199 * DB saved on disk
> [9470] 13 Sep 22:29:07.138 * Background saving terminated with success
>
> The slave's view:
> [23037] 13 Sep 21:42:07.799 * The server is now ready to accept connections
> at /tmp/redis.sock
> [23037] 13 Sep 21:42:07.799 * Connecting to MASTER...
> [23037] 13 Sep 21:42:07.799 * MASTER <-> SLAVE sync started
> [23037] 13 Sep 21:42:07.800 * Non blocking connect for SYNC fired the event.
> [23037] 13 Sep 21:45:43.167 # Timeout receiving bulk data from MASTER...
> [23037] 13 Sep 21:45:43.167 * Connecting to MASTER...
> etc.
>
> And the configuration on master:
> redis-cli -p 6380 config get client-output-buffer-limit
> 1) "client-output-buffer-limit"
> 2) "normal 0 0 0 slave 268435456 67108864 60 pubsub 33554432 8388608 60"
>
> So there was only 100mb in the output buffer, but it was shut down even
> though the limit was 268mb. Any ideas what might be going on? Apologies if
> this is a repost.
>
> Tudor
>
> --
> You received this message because you are subscribed to the Google Groups
> "Redis DB" group.
> To view this discussion on the web visit
> https://groups.google.com/d/msg/redis-db/-/IE0hzEVHMSAJ.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to
> redis-db+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/redis-db?hl=en.

tudor

unread,
Sep 18, 2012, 4:23:05 PM9/18/12
to redi...@googlegroups.com
That makes sense- thanks for the quick reply. Will try it.

Adi Chiru

unread,
May 29, 2014, 9:37:11 PM5/29/14
to redi...@googlegroups.com
Hello,

Can someone please share the implications on the master of increasing the limits or having no limits for the slave, at least at initial sync?

I have a 6GB redis server which is also pretty busy not just big. Trying to attach a slave gets into:

"[3355] 29 May 21:29:34.073 # Client addr=10.179.58.228:46269 fd=411 name= age=167 idle=167 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0 obl=16382 oll=2914 omem=153832192 events=r cmd=psync scheduled to be closed ASAP for overcoming of output buffer limits."

I can disable the limits (0 0 0 ) or increase them but I do not want to have a big impact on the master - is this possible?

Thanks,
Adi

Josiah Carlson

unread,
May 30, 2014, 12:58:48 AM5/30/14
to redi...@googlegroups.com
Well, there are two problems with your question regarding the initial sync.

The first problem is that there is no built-in method to say, "allow slaves to have a larger backlog during initial sync, but give it a lower limit during other times".

The second problem is that slaves can be in situations where they may never catch up, so there may not be an amount of memory or time that would be sufficient for it to sync.


In your case, it seems as though the slave never actually catches up with the master. If your server really is that busy, you could be running into network bandwidth issues. If you are having bandwidth issues, I'd suggest trying to use an SSH tunnel with compression enabled (use the -C option).

Another option is to give it a larger limit until you hit the point where X% of syncs are able to complete successfully (I'd pick 99% as my target), and/or you uncomfortable with giving a client that much leeway.

There are also two other options in Redis that can help alleviate these sorts of situations during a slave reconnection/resync: repl-backlog-size and repl-backlog-ttl. Those tell your system how much and how long it should keep a backlog of commands if there are intermittent connection issues from slaves. I usually set the repl-backlog-size to be about 1/2-3/4 the size of the on-disk snapshot, as I find it minimizes slave downtime for my uses.

 - Josiah



To unsubscribe from this group and stop receiving emails from it, send an email to redis-db+u...@googlegroups.com.

To post to this group, send email to redi...@googlegroups.com.
Visit this group at http://groups.google.com/group/redis-db.
For more options, visit https://groups.google.com/d/optout.

Adi Chiru

unread,
May 30, 2014, 1:20:58 AM5/30/14
to redis-db
Thanks for the fast reply Josiah,

When I was talking about disabling the limits (hard and soft) for the
slave only during initial sync I was referring to the possibility to
set these limits to 0 via redis-cli during runtime, attach the slave,
let it sync and then change those limits back to more conservative
values. I believe it can be done but I am not sure about the impact on
the master - that was my concern/question.

The 2 instances in my case are in 2 different availability zones in
AWS but I don't think I have bandwidth issues.
So, to give more info on my problems/setup, this is what happens as
soon as I start the slave:

[3355] 29 May 21:20:34.240 * Slave asks for synchronization
[3355] 29 May 21:20:34.240 * Full resync requested by slave.
[3355] 29 May 21:20:34.240 * Waiting for next BGSAVE for SYNC
[4242] 29 May 21:21:09.248 * DB saved on disk
[4242] 29 May 21:21:10.737 * RDB: 1006 MB of memory used by copy-on-write
[3355] 29 May 21:21:12.879 * Background saving terminated with success
[3355] 29 May 21:21:18.850 * Background saving started by pid 4826
[3355] 29 May 21:22:56.027 # Client addr=10.179.58.228:45862 fd=208
name= age=142 idle=142 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0
qbuf-free=0 obl=16377 oll=4182 omem=234916512 events=r cmd=psync
scheduled to be closed ASAP for overcoming of output buffer limits.
[3355] 29 May 21:23:51.688 * Slave asks for synchronization
[3355] 29 May 21:23:51.688 * Full resync requested by slave.
[3355] 29 May 21:23:51.688 * Waiting for next BGSAVE for SYNC
[4826] 29 May 21:24:18.947 * DB saved on disk
[4826] 29 May 21:24:20.333 * RDB: 1212 MB of memory used by copy-on-write
[3355] 29 May 21:24:22.610 * Background saving terminated with success
[3355] 29 May 21:24:28.603 * Background saving started by pid 5378
[3355] 29 May 21:25:56.082 # Client addr=10.179.58.228:46076 fd=313
name= age=125 idle=125 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0
qbuf-free=0 obl=16383 oll=2664 omem=149808560 events=r cmd=psync
scheduled to be closed ASAP for overcoming of output buffer limits.
[3355] 29 May 21:26:47.944 * Slave asks for synchronization
[3355] 29 May 21:26:47.944 * Full resync requested by slave.
[3355] 29 May 21:26:47.944 * Waiting for next BGSAVE for SYNC
[5378] 29 May 21:27:28.177 * DB saved on disk
[5378] 29 May 21:27:29.686 * RDB: 986 MB of memory used by copy-on-write
[3355] 29 May 21:27:31.831 * Background saving terminated with success
[3355] 29 May 21:27:37.814 * Background saving started by pid 5934
........

During this time, on the slave:

[4640] 29 May 20:32:42.355 * Connecting to MASTER redis3.freelancer.com:6379
[4640] 29 May 20:32:42.358 * MASTER <-> SLAVE sync started
[4640] 29 May 20:32:42.360 * Non blocking connect for SYNC fired the event.
[4640] 29 May 20:32:42.362 * Master replied to PING, replication can continue...
[4640] 29 May 20:32:42.363 * Partial resynchronization not possible
(no cached master)
[4640] 29 May 20:32:42.365 * Full resync from master:
621480e9295872416266e563939b4fd6724eb5b7:8823523103
[8470] 29 May 20:34:35.711 * DB saved on disk
[8470] 29 May 20:34:36.062 * RDB: 1 MB of memory used by copy-on-write
[4640] 29 May 20:34:36.636 * Background saving terminated with success
[4640] 29 May 20:36:28.710 # Timeout receiving bulk data from
MASTER... If the problem persists try to set the 'repl-timeout'
parameter in redis.conf to a larger value.
[4640] 29 May 20:36:28.710 * Connecting to MASTER redis3.freelancer.com:6379
[4640] 29 May 20:36:28.788 * MASTER <-> SLAVE sync started
[4640] 29 May 20:36:28.790 * Non blocking connect for SYNC fired the event.
[4640] 29 May 20:36:28.791 * Master replied to PING, replication can continue...
[4640] 29 May 20:36:28.793 * Partial resynchronization not possible
(no cached master)
[4640] 29 May 20:36:28.795 * Full resync from master:
621480e9295872416266e563939b4fd6724eb5b7:10105649942
[4640] 29 May 20:39:25.067 # Timeout receiving bulk data from
MASTER... If the problem persists try to set the 'repl-timeout'
parameter in redis.conf to a larger value.
[4640] 29 May 20:39:25.068 * Connecting to MASTER redis3.freelancer.com:6379
[4640] 29 May 20:39:25.073 * MASTER <-> SLAVE sync started
[4640] 29 May 20:39:25.075 * Non blocking connect for SYNC fired the event.
[4640] 29 May 20:39:25.076 * Master replied to PING, replication can continue...
[4640] 29 May 20:39:25.080 * Partial resynchronization not possible
(no cached master)
[4640] 29 May 20:39:25.082 * Full resync from master:
621480e9295872416266e563939b4fd6724eb5b7:10355757565
[4640] 29 May 20:41:25.421 * Background saving started by pid 8474
[4640] 29 May 20:42:31.522 # Timeout receiving bulk data from
MASTER... If the problem persists try to set the 'repl-timeout'
parameter in redis.conf to a larger value.



So, it looks like my ~6GB dump on master is not getting sent across to
the slave fast enough - can be a bandwidth problem but considering it
is 6GB.... maybe I should increase the "repl-timeout" only...!?
Can that be done? If yes, can you please tell me what are the
implications for the master?

Thanks,
Adi

Yiftach Shoolman

unread,
May 30, 2014, 2:02:36 AM5/30/14
to redi...@googlegroups.com, redis-db
I would never set the slave output buffer to 'unlimited' as your master may reach OOM. If I were u I would just increase both thresholds (and for simplicity make them the same value) until replication is completed and leave it as is so next time full SYNC happens there are great chances you are safe.

Furthermore, please make sure your repl_timeout is properly set on both master and slave

Sent from my iPhone

Josiah Carlson

unread,
May 30, 2014, 2:05:29 AM5/30/14
to redi...@googlegroups.com
Okay, reading the output from your logs, you've got a 6 gig snapshot (which suggests anywhere from 15-60 gigs in-memory). That snapshot takes 3 minutes to create, and uses roughly 1 gig of memory to perform the snapshot. To say that you are doing a lot of operations against Redis is a bit of an understatement.

I think this is strictly a time/memory issue, and is completely bandwidth-related. Think of it like this: 6 gigs at 100 mbit is 8 minutes. 6 gigs at 300 mbit (not unreasonable for cross AWS-AZ transfers) is a bit over 2 1/2 minutes. If you got a 50% reduction in data size with an SSH tunnel (or a 90%+ reduction like http://tech.3scale.net/2012/07/25/fun-with-redis-replication/), you can cut your times substantially.

Use an SSH tunnel with compression, and increase your time limits (but not your memory). Don't leave it unlimited, because the last thing you want is for a slave to end up killing your master due to too much memory use, even if you are babysitting it. Maybe give it a gig of backlog if you want to be really flexible, just make sure you've got enough spare RAM on the box first.

I have experienced variances in bandwidth between machines in different availability zones of up to around a factor of 4x, I have seen pairs of machines drop 5-10% of packets between them (but none to/from any other machines), I've had pairs of machines that refused to communicate with each other (but had no problems communicating with any other machine in the network), etc. And if you aren't using your slave for much of anything yet, consider replacing it to see if you have any better luck.

 - Josiah

Adi Chiru

unread,
May 30, 2014, 3:05:13 AM5/30/14
to redis-db
Thanks guys,

I adopted pretty much all the recommendations, including the ssh
compression idea, from you two and so:

root@redis3:~# redis-cli config get * | grep -A2 repl
repl-ping-slave-period
10
repl-timeout
1800
repl-backlog-size
1048576000
repl-backlog-ttl
3600

I evetualy also changed this, from no to yes (although I know it may
increase the bandwidth requirements):
repl-disable-tcp-nodelay
yes

But the result is actually worse.... by that I mean the master kicks
off the slave much faster.

Master:
[3355] 30 May 02:56:28.875 * Slave asks for synchronization
[3355] 30 May 02:56:28.875 * Full resync requested by slave.
[3355] 30 May 02:56:28.875 * Starting BGSAVE for SYNC
[3355] 30 May 02:56:35.376 * Background saving started by pid 16330
[3355] 30 May 02:56:43.733 # Client addr=127.0.0.1:49630 fd=185 name=
age=15 idle=15 flags=S db=0 sub=0 psub=0 multi=-1 qbuf=0 qbuf-free=0
obl=16372 oll=2896 omem=268753696 events=r cmd=psync scheduled to be
closed ASAP for overcoming of output buffer limits.
[16330] 30 May 02:59:44.675 * DB saved on disk
[16330] 30 May 02:59:46.216 * RDB: 2397 MB of memory used by copy-on-write
[3355] 30 May 02:59:48.806 * Background saving terminated with success
...

Slave:
[19460] 30 May 02:56:27.975 * DB loaded from disk: 210.107 seconds
[19460] 30 May 02:56:27.976 * The server is now ready to accept
connections on port 6379
[19460] 30 May 02:56:28.869 * Connecting to MASTER localhost:6280
[19460] 30 May 02:56:28.869 * MASTER <-> SLAVE sync started
[19460] 30 May 02:56:28.870 * Non blocking connect for SYNC fired the event.
[19460] 30 May 02:56:28.873 * Master replied to PING, replication can
continue...
[19460] 30 May 02:56:28.875 * Partial resynchronization not possible
(no cached master)
[19460] 30 May 02:56:28.877 * Full resync from master:
621480e9295872416266e563939b4fd6724eb5b7:68385253408
...

So, 3 questions please:

1. Do I need to make the same and all of the configurations on the slave too?

2. Can the "Partial resynchronization not possible (no cached master)"
be overcome somehow? I have full backups via BGSAVE of the master each
hour (yes, I stopped those while attempting the slaving stuff) so I
could use the latest one to load it masnually in the salve and then
attach the slave to the master - last time I tried that however, the
slave still requested a full resync....

3. Any other ideas?

Thanks,
Adi

On Fri, May 30, 2014 at 4:05 PM, Josiah Carlson

Yiftach Shoolman

unread,
May 30, 2014, 3:19:16 AM5/30/14
to redi...@googlegroups.com
Adi,

Just use this: 

--> config set client-output-buffer-limit "slave <low_threshold in bytes> <high_threshold in bytes> <timeout>"

And make low_threshold=high_threshold and timeout=0. So you will only get the buffer exceeded message if you are exceeding the threshold.

I would start with setting these values to 1GiB, i.e. 1024x1024x1024=1,073,741,824 (assuming u have enough spare memory on the machine), otherwise start smaller.

You can monitor the 'omem' parameter of CLIENT LIST and if you are close to the threshold you set just increase it.


BTW - this should work regardless to SSH/tunnel setting.  

Yiftach Shoolman
+972-54-7634621

Adi Chiru

unread,
May 30, 2014, 3:50:43 AM5/30/14
to redis-db
Hi Yiftach,

Tried it. With 1GB and then 2GB.
The limit is still reached eventually (in about 1 min for 1 GB and 2
min for 2GB) and the master drops the connection.

I could go to 3GB maybe even 4GB but not higher - however, this will
not solve the problem as far as I can see....

So, to see if I understand this properly, is the following correct?
I need to force the master to feed the slave at a slower rate for a
longer period of time and not kick it off when the buffer is full but
I suspect that the changes that happen in master are too fast to make
a slower sync useful as the slave will never catch up...

What are the alternatives in this case?

Thanks,
Adi

Yiftach Shoolman

unread,
May 30, 2014, 4:49:03 AM5/30/14
to redi...@googlegroups.com
Hi Adi,

There are multiple things that you can try to solve this problem (I must say this is not an easy one):
1. I would first disable the replication compression, i.e. go back to the standard replication, and see if it helps. In many cases bandwidth isn't an issue between multi zone in the same AWS region, and compression can just slowdown replication, meaning more data is written to the replication buffers

2. I would try to increase the replication buffer as much as I can, but pay attention and avoid reaching OOM. If you are close to that just kill the replication child - you have the PID in the log file

3. Where did u configure Redis to save the RDB file that was created during the BGSAVE process of the replication ?  Does it go to EBS ? or does it use the local ephemeral storage ? It is better to write it on the ephemeral storage and to stripe your disks first. Doing so will make the BGSAVE process faster, i.e. less data accumulated in the replication buffer

4. We managed to solve similar situations using a technique mentioned here. If more help is needed let me know

5. Worst case solution scenario - launch a bigger master instance and manually copy your RDB file to it. This means you will either have some downtime or you are going to lose your latest 'writes'. But if this the only alternative you have than go for it

Adi Chiru

unread,
Jun 1, 2014, 8:07:36 PM6/1/14
to redis-db
Thanks for the reply Yiftach,

1. I will retry this today

2. and will go up with the buffers.... maybe I am lucky :D

3. I use ephemeral disks so I should have a pretty stable and decent
performance although I could probably increase this by setting up a
raid array... that brings other complications though

4. I read that post you mentioned already but it worked in Redis 2.6
and the patch they did is not really a solution for me if I want no
downtime for the master. Was a similar functionality maybe included in
2.8.2 ? If so, can you point me to more info on this?

5. That would be the final solution we took into consideration already
but to go this route I need to make sure I can attach a slave to that
afterwards because we really need HA and this is the first step
towards that for sure.

Thanks for the input and for your time!

Adi



On Fri, May 30, 2014 at 6:48 PM, Yiftach Shoolman

Adi Chiru

unread,
Jun 1, 2014, 9:07:23 PM6/1/14
to redis-db
Hello,

I disabled the ssh tunnel stuff and started the slave instance with
the same buffers as in my previous attempts. This time everything
worked perfectly - to move the dump from master to slave it took about
2 minutes and to load it in the slave instance it took 197 seconds
after which everything looks perfect.

The main reason for which I believe this time worked was because I did
this during the lowest usage possible for this redis server; I am
going to keep an eye on this replication for about a week to make sure
the slave will be able to keep up with the master during the highest
load.

I will also setup monitoring so I have this question:

In this line:
slave0:ip=10.179.58.228,port=6379,state=online,offset=135643669285,lag=0

is the lag value a good indication of how up to date is the slave?
If I understand correctly what I read, these are the seconds the slave
would be behind the master - is this correct?

Now I am moving to an ever bigger redis server :D but not that busy hopefully

Thanks,
Adi



On Fri, May 30, 2014 at 6:48 PM, Yiftach Shoolman

Adi Chiru

unread,
Jun 2, 2014, 12:48:53 AM6/2/14
to redis-db
Hello again everyone,

The last issue I had with inital replication was solved for that
server just by attempting the initial sync durin the low usage times
with increased buffers.

Another redis server however, one even busier has this problem:

# Replication
role:master
connected_slaves:1
slave0:ip=10.144.178.126,port=6379,state=online,offset=83052571679,lag=1
master_repl_offset:83809730520
repl_backlog_active:18
repl_backlog_size:1048576000e
repl_backlog_first_byte_offset:82761154521
repl_backlog_histlen:1048576000

So initial sync succeeded but now it looks like the slave is falling
behind a lot.
Any way to fix this?

Also, I kept looking everywhere for what each of those lines means
exactly but could not find an answer, so what is repl_backlog_active ?
in this case, 18 what ? and what does it mean? should I have 1 or
close to 1 rather than a big number?

Thanks,
Adi


On Fri, May 30, 2014 at 6:48 PM, Yiftach Shoolman

Yiftach Shoolman

unread,
Jun 2, 2014, 5:56:10 AM6/2/14
to redi...@googlegroups.com
Hi Adi,

Seems that you took the right decision and did everything in the non peak hours. 

As for the 'lag' question AFAIR you are correct.

Best

Adi Chiru

unread,
Jun 2, 2014, 10:06:19 PM6/2/14
to redis-db
Thanks for help Yiftach,

However, from this, it looks like the lag is not something related to
how far behind the slave is:

# Replication
role:master
connected_slaves:1
slave0:ip=10.144.178.126,port=6379,state=online,offset=83052571679,lag=1
master_repl_offset:83809730520
repl_backlog_active:18
repl_backlog_size:1048576000e
repl_backlog_first_byte_offset:82761154521
repl_backlog_histlen:1048576000

So, as you can see, although the slave offset and the
master_repl_offset are not that close, lag is 1......

So, basically, after reading the docs many times and going through
some search results in this mailing list, I still have no idea how to
properly see how well the
slave is doing relative to the master.

How do you monitor that? (what should a script be looking for to send
an alert if the slave is falling behind too much)

Thanks,
Adi


On Mon, Jun 2, 2014 at 7:56 PM, Yiftach Shoolman
Reply all
Reply to author
Forward
0 new messages