feature request discussion: non-restarting replication

Jeremy Zawodny

unread,

Feb 22, 2012, 5:11:28 PM2/22/12

to redi...@googlegroups.com

We encountered an issue recently after re-sizing our redis clusters a bit and it led to a wish for a simple feature. I wanted to present the case for it here to see if anyone else had ideas to add before putting it on github or even trying to just code it up and submit a pull request.

We have redis clusters in two data centers: A and B. Both clusters contain 10 machines and they all run 4 instances of redis-server. One data center is "active" and the other is "standby". The machines are "paired" across data centers, so redis1 in data center B is slaving from redis1 in data center A.

The problem is that occasionally the WAN link between the data centers is interrupted and the slaves in data center B decide they need to re-sync with their masters. Unfortunately, *all* the instances on each slave try to do this AT THE SAME TIME and that causes too much stress on the masters. Effectively, the masters are DoSd by the slaves all re-syncing at the same time.

We've worked around this by reducing the max memory size of the instances, but we'd really like to make more RAM available to redis and have a more controlled way of doing the re-sync.

We already have a process in place to run periodically on the redis slaves and ensure that they're replicating properly. If there's a problem, it re-starts replication ONCE INSTANCE AT A TIME and makes sure everything is running well.

So what I'd like is a config directive in redis that says "if you're a slave and you lose contact with the master, do not re-sync." The idea is that I'd set this to true (it'd be false by default) and then my exiting script would handle those occasional times when slaves get disconnected.

Looking at the redis code, this should be fairly straightforward.

Comments or objections?

Thanks,

Jeremy

Scott Smith

unread,

Feb 22, 2012, 5:18:46 PM2/22/12

to redi...@googlegroups.com

+1 on that.

We experience the same problem if a host restarts. Is there a solution that would solve for both scenarios?

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.

Josiah Carlson

unread,

Feb 22, 2012, 5:25:56 PM2/22/12

to redi...@googlegroups.com

Scott: If you don't want replication when Redis first starts up,
disable replication in the configuration file, then enable it once it
is started via "SLAVEOF host port".

- Josiah

Jeremy Zawodny

unread,

Feb 22, 2012, 5:27:09 PM2/22/12

to redi...@googlegroups.com

Yes, that's what we do as well. That fixes the "at boot" DoS but not the random network fail case.

Jeremy

Josiah Carlson

unread,

Feb 22, 2012, 5:30:20 PM2/22/12

to redi...@googlegroups.com

I can see the use of this, but I can't help but think that this is one
of those things that maybe should be a special command instead of a
configuration option. The command would be something like "SLAVEOF
host port ONCE", which says that it will slave to that master until
the link goes down, then it won't reconnect.

Why not a config file option? Because configuration files are the
kinds of things that you set and never look at again, then 6 months
down the line someone is digging through it and asking "wtf did we do
that for?" I could get behind "SLAVEOF host port ONCE" if it was only
available via a remote command, and not with a configuration option.

Regards,
- Josiah

Jan Oberst

unread,

Feb 22, 2012, 7:06:04 PM2/22/12

to redi...@googlegroups.com

I agree with Josiah here.

Our config also has slaves start as masters. After boot we set SLAVEOF one machine after the next, like Jeremy mentioned.

We use a central management script that keeps track of all our redis machines. I think SLAVEOF ... ONCE would work well, because all we'd have to change is the central management tool.

I would add another INFO flag that states "slave_out_of_sync" or something similar. We're reading the INFO every minute anyways, so if a slave is out of sync we could just schedule it for another SLAVEOF .... ONCE call, which would effectively re-sync the machine.

Jay A. Kreibich

unread,

Feb 22, 2012, 9:46:49 PM2/22/12

to redi...@googlegroups.com

On Wed, Feb 22, 2012 at 02:18:46PM -0800, Scott Smith scratched on the wall:

> +1 on that.
>
> We experience the same problem if a host restarts. Is there a solution
> that would solve for both scenarios?

I might suggest the ability to configure a global lock file that is
shared by all Redis instances on a single physical server. The lock
could be used to block and/or delay [BG]SAVEs and/or SLAVEOF commands.
This would allow a set of instances to insure only a single instance
is attempting to save and/or sync at any given moment, reducing
contention for these high-resource commands.

Ideally, you could configure different lock files for BG[SAVE] and
SLAVEOF commands, although they might point to the same file.

The lock file could be a simple PID file. If the file existing (and
the process exists), the system is locked. If no file exists, a
process can grab the lock by simply writing out the file. Race
conditions can be avoided with the proper flags to open(2).

In the case of SAVE, I would have the command immediately return an
error if the lock cannot be acquired. For SLAVEOF, the command would
simply go idle until the lock can be acquired. BGSAVE might go idle
or might return... I'm not sure which makes more sense.

In the case of BGSAVE and SLAVEOF, the fork() would not be allowed until
the instance owns the lock. The lock would then be released as soon as
the child process exits (or, in the case of a SLAVEOF, when the
initial bulk transfer is complete). You might also be able to
configure a time-out, so that SLAVEOF returns an error after 300
seconds or something. Any time a command is outstanding, the system
would check for the lock ever 250ms or some other configurable value.

If we really want to get fancy, we could allow a set of lock files,
say .../redis-lock-[1-4].pid, to allow up to four operations at one
time. This might be useful for very larger servers with, for
example, a dozen instances. The lock files could still be used to
limit "overhead" resource usage, but would allow more than one
high-usage operation at a time.

Thoughts?

-j

--
Jay A. Kreibich < J A Y @ K R E I B I.C H >

"Intelligence is like underwear: it is important that you have it,
but showing it to the wrong people has the tendency to make them
feel uncomfortable." -- Angela Johnson

Greg Andrews

unread,

Feb 23, 2012, 2:33:45 AM2/23/12

to redi...@googlegroups.com

I seem to recall a discussion 6-9 months ago about this same situation. The thread centered around creating a config limit on the number of simultaneous slave SYNC commands a master will allow. I thought there was some progress made toward creating a patch and getting it included in a release?

-Greg

On Wed, Feb 22, 2012 at 2:11 PM, Jeremy Zawodny <Jer...@zawodny.com> wrote:

--

catwell

unread,

Feb 23, 2012, 4:01:23 AM2/23/12

to Redis DB

On Feb 22, 11:30 pm, Josiah Carlson <josiah.carl...@gmail.com> wrote:

> Why not a config file option? Because configuration files are the
> kinds of things that you set and never look at again, then 6 months
> down the line someone is digging through it and asking "wtf did we do
> that for?"

On the other hand you can add comments to configuration files to
explain your choices, and you can version them. And you can even
put them in cfengine / puppet / chef if you want.

My favorite kind of configuration system for critical infrastructure
(which Redis has become) is something similar to Cisco's IOS, where
the configuration is dynamic but you can dump it to a file and copy
it to another instance.

Also, now that we have Lua in Redis, why not use it as a configuration
language? After all, it's pretty good at that.

Dvir Volk

unread,

Feb 23, 2012, 4:35:55 AM2/23/12

to redi...@googlegroups.com

yeah, this one was initiated by my pains with this situation. it was then my impression that my scenario of many slaves for one master was rare and this wasn't a priority. maybe things have changed since?

--

Dvir Volk

System Architect, The Everything Project (formerly DoAT)

http://everything.me

Colin Vipurs

unread,

Feb 23, 2012, 4:48:09 AM2/23/12

to redi...@googlegroups.com

+1 as well. It seems that this could be a useful feature for doing a
one-time failover from master to slave

--
Maybe she awoke to see the roommate's boyfriend swinging from the
chandelier wearing a boar's head.

Something which you, I, and everyone else would call "Tuesday", of course.

Pedro Melo

unread,

Feb 23, 2012, 6:18:25 AM2/23/12

to redi...@googlegroups.com

Hi,

On Wed, Feb 22, 2012 at 10:11 PM, Jeremy Zawodny <Jer...@zawodny.com> wrote:
> The problem is that occasionally the WAN link between the data centers is
> interrupted and the slaves in data center B decide they need to re-sync with
> their masters. Unfortunately, *all* the instances on each slave try to do
> this AT THE SAME TIME and that causes too much stress on the masters.
> Effectively, the masters are DoSd by the slaves all re-syncing at the same
> time.

I understand the problem this is causing your system but I believe the
solution you are presenting is targeting the symptom and not the root
cause of the problem.

There is one event here that triggers a chain of two problems/symptoms:

* Event: connectivity loss between master and slave;
* Problem 1: slave needs full re-sync with master;
* Problem 2: N slaves doing this at the same time will cause a DoS on masters.

The solutions presented on this thread try to tackle Problem 2, how to
prevent the DoS of the master, and although it is a valid problem and
should be solved (I'm particularly fond of SLAVEOF host port ONCE
myself), it doesn't fix the initial problem: the need for a full
re-sync.

I would propose that, for each slave, a rotating AOF file should be
kept, based on time or size, with older files being removed when
slaves ACK back synchronization points reached.

For example, when a slave connects, it tells you what was the last
sync point it saw, and the master only has to send the AOF's since
that sync point. Every time the master rotates a slave AOF, it sends
the new name to the SLAVE. Every time a slave ACKs a specific AOF sync
point, all AOFs up-to that one can be removed (or archived, if your
business rules require that).

I'm sure that this simplistic approach has holes in it, I didn't
thought it out thoroughly yet, but my initial point still stands: you
are fixing a symptom, not the cause. It might be enough, and thats
fine, just pointing it out though :).

Best regards,
--
Pedro Melo
@pedromelo
http://www.simplicidade.org/
http://about.me/melo
xmpp:me...@simplicidade.org
mailto:me...@simplicidade.org

Salvatore Sanfilippo

unread,

Feb 23, 2012, 7:28:40 AM2/23/12

to redi...@googlegroups.com

Hello Pedro,

I agree with your analysis. Jeremy's proposal, while could be actually
useful to mitigate the problem, does not fix the root cause.
I also agree about incremental resync as a solution to many of this issues.

However I think the implementation of incremental resync should use
the implementation proposed here:

https://github.com/antirez/redis/issues/189

In short it uses the trick of still accumulating the output buffer of
the slaves for some time (or for some space) while the slave is not
connected. Moreover there is a sliding window so that we don't discard
the buffer sent to the slaves but take it for some time, since a slave
may want to resync from an offset that is already flushed on the
socket.

But back to the root cause for a moment, the problem is: "currently
Redis does not handle well the case when multiple slaves want sync at
once". I trust you about that, but I would understand why this
happens.

I mean, even without partial resync, Redis should handle that better.
Full resync should just be slower, but not a DoS.
Redis is already optimized to do a single BSAVE on reconnection of
multiple slaves, so what is actually DosSing it? Maybe the multiple
bulk transfers generate too much I/O and we should trottle this stuff?

Please if you have some information on this matter and how I can
reproduce it I would love to insert this fix into 2.6 if possible.

Salvatore

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>

--
Salvatore 'antirez' Sanfilippo
open source developer - VMware

http://invece.org
"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele

Pedro Melo

unread,

Feb 23, 2012, 9:47:18 AM2/23/12

to redi...@googlegroups.com

Hi,

On Thu, Feb 23, 2012 at 12:28 PM, Salvatore Sanfilippo
<ant...@gmail.com> wrote:
> However I think the implementation of incremental resync should use
> the implementation proposed here:
>
> https://github.com/antirez/redis/issues/189

I knew I'd read something about partial resync, but forgot to search
the issues :)

I like it, although it would only survive small downtimes (which will
probably cover most of the situations, so no worries there), and I
really like the use of bytes written/read as a sync marker. Simple and
effective.

> In short it uses the trick of still accumulating the output buffer of
> the slaves for some time (or for some space) while the slave is not
> connected. Moreover there is a sliding window so that we don't discard
> the buffer sent to the slaves but take it for some time, since a slave
> may want to resync from an offset that is already flushed on the
> socket.

I don't know if you can use the same buffer for all clients, unless
you send the current byte count after the full dump.

> I mean, even without partial resync, Redis should handle that better.
> Full resync should just be slower, but not a DoS.
> Redis is already optimized to do a single BSAVE on reconnection of
> multiple slaves, so what is actually DosSing it? Maybe the multiple
> bulk transfers generate too much I/O and we should trottle this stuff?
>
> Please if you have some information on this matter and how I can
> reproduce it I would love to insert this fix into 2.6 if possible.

I assume this last two paragraphs are for Jeremy, since he is the one
with the problem.

Bye,

Jeremy Zawodny

unread,

Mar 14, 2012, 2:35:44 PM3/14/12

to redi...@googlegroups.com

Sorry for the delay on getting back to this issue... Here's what has happened to us a few times (with a bit more detail).

We have 10 hosts in two data centers (a and b). Let's call the hosts host1a, host1b, host2a, host2b, etc.

Every host runs 4 instances of redis-server and has 32GB of RAM. All "b" hosts replicate from "a" hosts, so:

host1b:63790 is a slave of host1a:63790

host1b:63791 is a slave of host1a:63791

host1b:63792 is a slave of host1a:63792

host1b:63793 is a slave of host1a:63793

And so on with the other 9 pairs.

Each redis-server was configured with:

maxmemory 7gb

maxmemory-policy volatile-lru

maxmemory-samples 20

And there's is no persistance aside from the .rdb files that are created at (1) shutdown or (2) during replication sync.

This is an important point: our instances are almost always "full" and we're relying on the lru to evict data continuously.

So, what happens is this:

(1) the network between the "a" and "b" hosts becomes interrupted

(2) the slaves in "b" lose contact with "a" and eventually timeout

(3) the slaves in "b" decide to re-sync -- ALL AT ONCE

(4) the redis instances in "a" each start to dump their .rdb files

(5) since there are several going at once, the dumping to disk is i/o bound

(6) the dumping takes longer than it should, which results in more dirty COW pages

(7) the fact that we're always full and evicting keys makes #6 worse

(8) the box starts to swap, which makes #7 worse

(9) we enter a death spiral which is hard to recover from

However, if we were to rsync one instance at a time (we already have external code for this, as I mentioned), this problem doesn't occur and our instances resync pretty quickly.

The only other solution, which sucks, is to really lower the max-memory on our instances quite a bit but that's a wasteful solution in my eyes.

Does this help clarify what we're seeing and why I believe my proposed fix (a non-restarting replication option) would help to prevent it?

Thanks,

Jeremy

Jeremy Zawodny

unread,

Mar 14, 2012, 2:50:09 PM3/14/12

to redi...@googlegroups.com

Oh, and there are a few points I wanted to make specifically...

On Thu, Feb 23, 2012 at 4:28 AM, Salvatore Sanfilippo <ant...@gmail.com> wrote:

Hello Pedro,

I agree with your analysis. Jeremy's proposal, while could be actually
useful to mitigate the problem, does not fix the root cause.
I also agree about incremental resync as a solution to many of this issues.

However I think the implementation of incremental resync should use
the implementation proposed here:

https://github.com/antirez/redis/issues/189

In short it uses the trick of still accumulating the output buffer of
the slaves for some time (or for some space) while the slave is not
connected. Moreover there is a sliding window so that we don't discard
the buffer sent to the slaves but take it for some time, since a slave
may want to resync from an offset that is already flushed on the
socket.

But in our case, when we're trying to use as much RAM on the box for redis as we we can (across many instances), I wonder if the extra buffering would start to cause problems too.

But back to the root cause for a moment, the problem is: "currently
Redis does not handle well the case when multiple slaves want sync at
once". I trust you about that, but I would understand why this
happens.

I wouldn't say it that way. I'd say that redis assumes there is typically a single redis instance running on a given host. However, we're deploying them in a "1 instance per CPU core" environment. And our newer hosts are coming with 24 cores, which will just amplify the problem. (Thankfully they have SSDs so the disk i/o issue may be mitigated somewhat.)

I mean, even without partial resync, Redis should handle that better.
Full resync should just be slower, but not a DoS.
Redis is already optimized to do a single BSAVE on reconnection of
multiple slaves, so what is actually DosSing it? Maybe the multiple
bulk transfers generate too much I/O and we should trottle this stuff?

Please if you have some information on this matter and how I can
reproduce it I would love to insert this fix into 2.6 if possible.

Again, the real issue is not how redis handles re-sync. It does that well. But it doesn't give us enough control over what is currently an automatic behavior that ends up being harmful if you run enough instances on a large host.

Jeremy

Jeremy Zawodny

unread,

Mar 14, 2012, 4:17:45 PM3/14/12

to redi...@googlegroups.com

And here's an implementation that seems to work in my testing:

https://github.com/jzawodn/redis/commit/be22df6f931dfd826d9d9541a6f4037392a34715

I'll submit a pull request and see what happens. :-)

Jeremy

hirose31

unread,

Mar 25, 2012, 11:47:23 PM3/25/12

to Redis DB

I am looking forward to merge 2.4 branch!

On 3月15日, 午前5:17, Jeremy Zawodny <Jer...@Zawodny.com> wrote:
> And here's an implementation that seems to work in my testing:
>

> https://github.com/jzawodn/redis/commit/be22df6f931dfd826d9d9541a6f40...

> >> On Thu, Feb 23, 2012 at 12:18 PM, Pedro Melo <m...@simplicidade.org>

> >> > xmpp:m...@simplicidade.org
> >> > mailto:m...@simplicidade.org

Jeremy Zawodny

unread,

Jun 19, 2012, 12:43:51 PM6/19/12

to redi...@googlegroups.com

Just to follow-up on this, I've ported my patch to 2.6-rc4 and we've been running that in production for a few days now.

I'd like to submit a pull request, but don't know if the maintainers are interested in merging it.

For an idea of how little it changes, here's the old 2.4 changes needed:

https://github.com/jzawodn/redis/commit/be22df6f931dfd826d9d9541a6f4037392a34715

Thoughts?

antirez? pietern?

Jeremy

Jeremy Zawodny

unread,

Dec 27, 2012, 2:18:48 PM12/27/12

to redi...@googlegroups.com

Ok, 6 months later I've ported that feature to 2.6 and submitted a pull request against 2.6:

https://github.com/antirez/redis/pull/853

Any interest aside from us at craigslist?

Thanks,

Jeremy

Dvir Volk

unread,

Dec 27, 2012, 2:29:47 PM12/27/12

to redi...@googlegroups.com

Nice!

To me this is one of the things I miss most in redis, and this is a nice step in the direction.

Being such a small optional patch, I'd really love to see this getting pulled.

BTW Isn't there some kind of solution to this planned as part of sentinel? and smooth replication planned as a part of 2.8?

Dvir Volk

Chief Architect, Everything.me

http://everything.me

Josiah Carlson

unread,

Dec 27, 2012, 4:56:49 PM12/27/12

to redi...@googlegroups.com

Dvir,

I think you misread the patch. This just says whether or not a slave
would reconnect on connection failure.

- Josiah

Dvir Volk

unread,

Dec 27, 2012, 5:03:11 PM12/27/12

to redi...@googlegroups.com

No, I read it (it wasn't that long!) and it's nice - not a silver bullet but as I said a simple first step.

Sentinel was going to limit the number of reconnections which is better, and 2.8 is supposed to introduce partial sync which will be the ultimate solution.

but as a workaround for the time being - why not?

Jeremy Zawodny

unread,

Dec 27, 2012, 5:07:39 PM12/27/12

to redi...@googlegroups.com

Yeah, that's pretty much my thinking. We need (and use) this feature already. It keeps us from DoSing ourselves when the link between our datacenters fails and 8 redis instances on each box try to resync at the same time and the OOM killer gets busy on the master nodes--definitely not fun.

Jeremy

Dvir Volk

unread,

Dec 27, 2012, 5:10:36 PM12/27/12

to redi...@googlegroups.com

what manages reconnections?

Jeremy Zawodny

unread,

Dec 27, 2012, 5:15:02 PM12/27/12

to redi...@googlegroups.com

I have a script that runs via cron every few minutes and checks the replication state of all redis instances on localhost. If the host is expected to host masters, it simply exits. If it expected to host slaves, it will:

connect to each instance
verify slaving is active
if slaving is not active, it initiates replication via SLAVEOF $master_host:$master_port
it waits until the slave has fully synced before moving to the next instance

This allows us to full re-sync all instances on a host (one at a time) without having to worry about DoSing ourselves. And, of course, we have monitoring in place to let us know if a slave has not been slaving for "too long".

Jeremy

Dvir Volk

unread,

Dec 27, 2012, 5:31:41 PM12/27/12

to redi...@googlegroups.com

Thanks.

It's cool to see something that simple running one of the world's largest websites :)

you're not running on a cloud provider, am I right? If so, how often and why do you usually see these disconnects?

Jeremy Zawodny

unread,

Dec 27, 2012, 6:12:17 PM12/27/12

to redi...@googlegroups.com

Correct, we're self-hosted on our own eqipment.

We see the disconnects very rarely these days, but when we do see them they can be VERY painful without this patch.

I'd say half the time now it's due to planned maintenance and the other half is surprising.

If it'd be useful, I can probably post a slightly sanitized version of the script we use for this. I'd need to remove the dependency on our custom config module.

Jeremy

Reply all

Reply to author

Forward