Furking (forking) issues /tags: fork ec2 parent child process process cpu memory ram rc3

Aníbal Rojas

unread,

Jul 29, 2010, 4:56:51 PM7/29/10

to Redis DB

Crime Scene:

Redis RC3, running in a m2.4xlarge EC2 (68.4 GB of memory, 8 virtual
cores) with ~ 70MM keys.

Victim:

The Redis master process, each time if forks it pegs the CPU to 100%
for _minutes_, and stops responding to clients while the requested
command (BGSAVE, BGAOFRWRITE, etc) hasn't finished.

Suspects:

* EC2 virtualization
* Number of updates
* Heavy use of ZSETS (fragmented memory?)

--
Aníbal

Andy McCurdy

unread,

Jul 29, 2010, 5:48:43 PM7/29/10

to redi...@googlegroups.com

We've seen the same behavior on lower volume datasets too. Number of updates is definitely a factor here as the more updates, the higher likelihood that they'll trigger copy-on-write for many pages, eventually swapping the process.

If you're using a lot of ZSETs, there's two gotcha's:

1- they take significantly longer to stream/write during a save than any other data type.

2- they're generally much larger from a memory footprint perspective than other data types. (for reference, we have ~1M ZSETs on an m1.large (8gigs total) and are forced to use VM.)

Like I said in IRC, best scenario we've found is use replication, turn off saving on the master, and only do saves from the slave. It's kinda frustrating.

-andy

2010/7/29 Aníbal Rojas <aniba...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.

Salvatore Sanfilippo

unread,

Jul 29, 2010, 5:58:56 PM7/29/10

to redi...@googlegroups.com

On Thu, Jul 29, 2010 at 11:48 PM, Andy McCurdy <sed...@gmail.com> wrote:
> We've seen the same behavior on lower volume datasets too. Number of updates
> is definitely a factor here as the more updates, the higher likelihood that
> they'll trigger copy-on-write for many pages, eventually swapping the
> process.

Hello Andy, indeed in general what happens is the following:

COW gets worse and worse proportionally to:

1) Saving time
2) Number of read and even worse write operations performed against
data while the child is saving

So if there is a very high traffic, the only viable solutions are:

1) A saving slave
2) Using two times the RAM needed, so that there is 50% of RAM free to
perform the copy, but still there is to pay the penalty of copying
every page even if the system at least will not start swapping.

> If you're using a lot of ZSETs, there's two gotcha's:
> 1- they take significantly longer to stream/write during a save than any
> other data type.

This is due to the fact that turning floats into strings is a slow
process. This improved at lot in 2.0 compared to 1.2.x but still it's
not as fast as serializing another data type. If your sorted sets are
actually "integers" that is, numbers without decimal part, this will
get a bit faster.

This is definitely something we can try to improve in the future.

> 2- they're generally much larger from a memory footprint perspective than
> other data types. (for reference, we have ~1M ZSETs on an m1.large (8gigs
> total) and are forced to use VM.)

This is more or less unavoidable. In order to meet the big-O stuff
that sorted set promise there is to take two "views" of the same data,
that is, as a tree (skip list in our specific case) and an hash table
at the same time.

> Like I said in IRC, best scenario we've found is use replication, turn off
> saving on the master, and only do saves from the slave. It's kinda
> frustrating.

We can try to make it better, but there is some fundamental
theoretical limit about this:

When you want to do point-in-time saving, like BGSAVE and BGREWRITEAOF
are doing, the only possibility is to remember how the dataset used to
be. We are doing this via COW semantic and fork(), another possibility
is to track changes and remember this changes while saving. It's much
more complex but anyway requires a memory proportional to the number
of updates performed.

So to have at the same time sorted sets updated at blazing speed, and
point-in-time saving (being it .rdb saves or non-blocking log
rewriting) the only alternative is using 2x memory, in the form of a
single box with more memory or as an alternative of a slave/master
pair.

What we can try to do is to lower the constants: less percentage of
memory used, faster saving, ...

Cheers,
Salvatore

--
Salvatore 'antirez' Sanfilippo
http://invece.org

"Once you have something that grows faster than education grows,
you’re always going to get a pop culture.", Alan Kay

Andy McCurdy

unread,

Jul 29, 2010, 7:01:09 PM7/29/10

to redi...@googlegroups.com

On Thu, Jul 29, 2010 at 2:58 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote:

We can try to make it better, but there is some fundamental
theoretical limit about this:

When you want to do point-in-time saving, like BGSAVE and BGREWRITEAOF
are doing, the only possibility is to remember how the dataset used to
be. We are doing this via COW semantic and fork(), another possibility
is to track changes and remember this changes while saving. It's much
more complex but anyway requires a memory proportional to the number
of updates performed.

So this actually doesn't seem terribly difficult. Admittedly, I haven't implemented it, and I'm probably not thinking about critical details :) Nevertheless, I'll make a proposal.

Assumption: Redis makes a restriction that only a single (BG)SAVE or BGREWRITEAOF operation can be run at any given time. A subsequent SAVE call would return an error indicating that a save is already taking place.

Implementation:

- When any of these commands are run, create a new temp global hash table with 0 objects in it.

- When the temp global hash table != NULL:

- Read operations first check the temp global hash table. If the key being read isn't there, fall back to looking for it in the normal hash table. (See note about deletes below)

- Write operations check to see if the key being written to is already in the temp hash table. If it isn't, copy the entire value of the key from the normal hash table to the temp hash table. Finally, perform the write option, updating the value in the temp hash table.

- Delete operations on a key add a new entry into the temp hash table with a value of NULL. This special value indicates the value no longer exists.

- When the IO of the save operation is finished, start iterating through the temp hash table and move the values back into the original hash table.

- While moving key/value pairs, if a value is NULL, delete it from the normal hash table rather than copying it.

- For performance reasons, you'd probably want to move a handful of keys each tick rather all in one loop.

- While the values are being moved, it's fine if new write operations are happening -- the same process continues -- the key gets moved into the temp hash table and written there. It will get moved back into the normal hash table at some point.

- Finally, when the length of the temp hash table reaches 0, destroy the temp hash table and return from the SAVE. Operations now work as usual, updating the normal hash table directly.

Again, I might be forgetting about some critical piece why this wouldn't work. If this would work though, I think it'd go a long way into making saving more tolerable.

-andy

Josiah Carlson

unread,

Jul 29, 2010, 7:06:58 PM7/29/10

to redi...@googlegroups.com

Also, there's the other small problem that when slaves
connect/reconnect, the master makes a fresh dump before streaming to
the client. I've run into this issue a few times myself... you have
to get your slaves in before memory use creeps up, and/or during a
period of lower write volume.

- Josiah

On Thu, Jul 29, 2010 at 2:58 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote:

Aníbal Rojas

unread,

Jul 29, 2010, 7:19:39 PM7/29/10

to Redis DB

Salvatore, Andy,

I understand I can't have it all (I am a married man). I
understand that RAM isn't magic (but feels like) and there are some
limitations we have to respect (again I am married)

For what I am using it, I want Redis for its speed, period. I
would be more than happy to delegae persistence to another instance
(but it is expensive) and that's why I proposed a "Persistence only
slave instance with low memory requirements" a while ago:

http://groups.google.com/group/redis-db/browse_thread/thread/f3098b927ce86aef/b28f013e523a2fc8?#b28f013e523a2fc8

BUT, at least to start the slave the Master will have to fork, and
again block... :(

The only workaround I see (from a operational point of view) is
sharding the data to avoid having one big Redis instance with lots of
RAM and updates.

Thanks a lot for your help,

--
Aníbal

> Salvatore 'antirez' Sanfilippohttp://invece.org

Aníbal Rojas

unread,

Jul 29, 2010, 7:30:39 PM7/29/10

to redi...@googlegroups.com

Josiah,

Thanks, in my case there is no such thing as a low write window.

I am thinking that sharding to lower memory and updates is the way to go.

--
Aníbal Rojas
Ruby on Rails Web Developer
http://www.google.com/profiles/anibalrojas

Josiah Carlson

unread,

Jul 29, 2010, 7:39:37 PM7/29/10

to redi...@googlegroups.com

I've got a script that I use to mirror a Redis server when it won't
otherwise let me save/slave. It could be easily modified to handle a
situation where you are dual-writing to your main redis server and the
shards.

Assuming a dual-write scenario; a single pass of the script could
handle synchronization, at which point you could decommission your
single redis server.

- Josiah

2010/7/29 Aníbal Rojas <aniba...@gmail.com>:

Aníbal Rojas

unread,

Jul 29, 2010, 7:45:26 PM7/29/10

to redi...@googlegroups.com

Josiah,

Interesting. Gist? Github? And thanks.

--
Aníbal Rojas
Ruby on Rails Web Developer
http://www.google.com/profiles/anibalrojas

On Thu, Jul 29, 2010 at 7:09 PM, Josiah Carlson

Josiah Carlson

unread,

Jul 29, 2010, 8:07:20 PM7/29/10

to redi...@googlegroups.com

http://gist.github.com/499552

Aníbal Rojas

unread,

Jul 29, 2010, 9:24:38 PM7/29/10

to redi...@googlegroups.com

Something like a last resource, to pull out data from a non persistible Redis instance, I hope not to get there. I this case the DEBUG command proposed by Salvatore would come handy.

On Jul 29, 2010 7:37 PM, "Josiah Carlson" <josiah....@gmail.com> wrote:

http://gist.github.com/499552

2010/7/29 Aníbal Rojas <aniba...@gmail.com>:
> Josiah,
>

> Interesting. Gist? Github? And thanks...

Pieter Noordhuis

unread,

Jul 30, 2010, 3:48:07 AM7/30/10

to redi...@googlegroups.com

Andy: this indeed sounds like a very obvious solution but it comes
with numerous drawbacks. For instance, when this secondary hash table
is active and there is a write against a large ZSET for instance, we
need to create a copy to the new hash table. Unlike disk-based k/v
stores, Redis stores everything in memory, which causes a sorted set
to be stored in many fragments. The only way to copy it, is to
literally loop over all the elements, allocate some memory for every
new element and store it. If indeed it was as easy as just copying a
few megabytes of memory, this was a much easier problem to solve.
However, this looping will take much time and render (the initial)
writes to Redis after SAVE was issues *very* slow, in contrary to the
current fork() approach.

The problem with doing the SAVE operation incrementally, as the script
provided by Josiah does, is that the snapshot of the k/v-space is not
consistent. If there are writes to parts of the keyspace that were
already saved, there is no way to recover those. And, if you think of
storing some kind of AOF for those parts, you still have the problem
of writes to parts of the keyspace that were not covered yet. This,
together with composite operations (SUNIONSTORE, SORT STORE, etc) make
for a very hard problem to solve within the same process, where again
the fork() approach gives us the results we want.

Cheers,
Pieter

2010/7/30 Aníbal Rojas <aniba...@gmail.com>:

Andy McCurdy

unread,

Jul 30, 2010, 11:45:17 AM7/30/10

to redi...@googlegroups.com

On Jul 30, 2010, at 12:48 AM, Pieter Noordhuis <pcnoo...@gmail.com> wrote:

> Andy: this indeed sounds like a very obvious solution but it comes
> with numerous drawbacks. For instance, when this secondary hash table
> is active and there is a write against a large ZSET for instance, we
> need to create a copy to the new hash table. Unlike disk-based k/v
> stores, Redis stores everything in memory, which causes a sorted set
> to be stored in many fragments. The only way to copy it, is to
> literally loop over all the elements, allocate some memory for every
> new element and store it. If indeed it was as easy as just copying a
> few megabytes of memory, this was a much easier problem to solve.
> However, this looping will take much time and render (the initial)
> writes to Redis after SAVE was issues *very* slow, in contrary to the
> current fork() approach.
>

Ya, I figured this would be an issue. How does the vm implementation deal with this? Seems like it'd be a similar problem where each element of the zset would have to be restored individually.

It'd be interesting to see how much time it would really take to copy a large zest value -- say 1000 or 10000 elements. The performance impact may still be far better than fork+cow causing the entire instance to be unresponsive for minutes. Seems like a single key being unresponsive for a second is the lesser of two evils here.

Pieter Noordhuis

unread,

Jul 30, 2010, 12:06:46 PM7/30/10

to redi...@googlegroups.com

With the threaded VM, the swapped out values are loaded
asynchronously. The client will simply block until the value (or more
than 1 value) is swapped in before the command gets executed, without
affecting other clients.

Currently, the only environment where fork() is causing issues is EC2.
There are no reports of this issue in a non-virtualized environment.
The problem with the key-level copy-on-write solution is that it is
hard to predict what is going to happen performance-wise after a
(BG)SAVE. Apart from this specific fork() issue on EC2, we know that
issuing a BGSAVE will hardly affect throughput (there is only the
initial overhead of calling fork()), which is important to a lot of
users. Other than that, I very much appreciate this discussion and
it's important to keep thinking about ways to improve this.

Cheers,
Pieter

Josiah Carlson

unread,

Jul 30, 2010, 12:49:20 PM7/30/10

to redi...@googlegroups.com

On Fri, Jul 30, 2010 at 12:48 AM, Pieter Noordhuis
<pcnoo...@gmail.com> wrote:
> Andy: this indeed sounds like a very obvious solution but it comes
> with numerous drawbacks. For instance, when this secondary hash table
> is active and there is a write against a large ZSET for instance, we
> need to create a copy to the new hash table. Unlike disk-based k/v
> stores, Redis stores everything in memory, which causes a sorted set
> to be stored in many fragments. The only way to copy it, is to
> literally loop over all the elements, allocate some memory for every
> new element and store it. If indeed it was as easy as just copying a
> few megabytes of memory, this was a much easier problem to solve.
> However, this looping will take much time and render (the initial)
> writes to Redis after SAVE was issues *very* slow, in contrary to the
> current fork() approach.
>
> The problem with doing the SAVE operation incrementally, as the script
> provided by Josiah does, is that the snapshot of the k/v-space is not
> consistent. If there are writes to parts of the keyspace that were
> already saved, there is no way to recover those. And, if you think of
> storing some kind of AOF for those parts, you still have the problem
> of writes to parts of the keyspace that were not covered yet. This,
> together with composite operations (SUNIONSTORE, SORT STORE, etc) make
> for a very hard problem to solve within the same process, where again
> the fork() approach gives us the results we want.

My script was more of an "oy! emergency!" solution (which I've used in
the past, incidentally). There are tricks that can be done for
dual-writing scenarios (for switching from a single instance to a
sharded instance as Andy had talked about), but there are race
conditions with all of the sequence objects in that case.

Overall, it's a tough problem. About the only low-memory way of doing
it is AOF, but then there is the AOF rewriting, which necessarily
takes more memory (though there are some multi-pass tricks that
perform well with a modest amount of memory, but do take multiple
passes over the data).

- Josiah

Andy McCurdy

unread,

Jul 30, 2010, 1:10:58 PM7/30/10

to redi...@googlegroups.com

Actually, I'm able to simulate the swapping-to-hell (as I've dubbed it) on a local, non-virtualized box here with my dataset. It's not just an EC2 issue.

Salvatore Sanfilippo

unread,

Jul 30, 2010, 2:21:34 PM7/30/10

to redi...@googlegroups.com

On Fri, Jul 30, 2010 at 1:01 AM, Andy McCurdy <sed...@gmail.com> wrote:
>
>
> On Thu, Jul 29, 2010 at 2:58 PM, Salvatore Sanfilippo <ant...@gmail.com>
> wrote:
>>
>> We can try to make it better, but there is some fundamental
>> theoretical limit about this:

Hello Andy,

what you described falls in the (only possible) category of "copy on
write" that I described in this thread, but with different (better)
constant factors.

COW is doing exactly this currently, that is, to copy values when they
are modified. But this is currently done in a per-age basis. Instead
with your approach this will be done in a per-key basis.

Both have problems as for instance it's much better to just copy a
page after an LSET than a 10000 elements list, so sometimes one wins,
some times the other one, but the code complexity is completely
different, as the current solution is basically for free from the
point of view of complexity.

What I think is that in the short term there is not too much to do
about that *but* to at least remove the COW when reading values, that
is, set a flag in the configuration file that will cause a value to be
always copied by addReply() instead to use the refcount. This will be
much better from the point of view of COW for read operations.

Long term instead we can investigate this: to incrementally compact
the AOF file from another unrelated process. So there will never be
the need to call BGREWRITEAOF.

This is *very* hard when you have aggregate types in a KV store,
especially since the AOF is composed of all the possible commands
Redis is able to handle, but maybe there is some solution. Probably
this is not a real world possibility but it's worth investigating.

Btw every solution is *mathematically* bound to copy the data that is
modified while the server is running. It's not something that you can
escape when the need is point-in-time dumps. So the real world
solutions are IMHO:

1) Redis-cluster... we can just run many instances that will save
independently. There will be no longer a unique big instance.
2) Now that Redis 2.2 is more and more expensive memory-wise, just
account 2x the memory you need if you want to save without delays
while there is a very high traffic against your instance.
3) Find allocation tricks so that we can try to enhance the locality
of data, so that with high probability a list or a sorted set are
using the same "memory pool" and are more or less in very near pages.

For sure there is to investigate more in this regard, but it's better
to do so *after* clustering is available. Maybe it will just fix this
issue automagically.

Aníbal Rojas

unread,

Jul 30, 2010, 9:36:43 PM7/30/10

to redi...@googlegroups.com

>
> Long term instead we can investigate this: to incrementally compact
> the AOF file from another unrelated process. So there will never be
> the need to call BGREWRITEAOF.

A persistence only process, yes.

> This is *very* hard when you have aggregate types in a KV store,
> especially since the AOF is composed of all the possible commands
> Redis is able to handle, but maybe there is some solution. Probably
> this is not a real world possibility but it's worth investigating.

I realize it is complex...

> Btw every solution is *mathematically* bound to copy the data that is
> modified while the server is running. It's not something that you can
> escape when the need is point-in-time dumps. So the real world
> solutions are IMHO:
>
> 1) Redis-cluster... we can just run many instances that will save
> independently. There will be no longer a unique big instance.

That's it, throwing more servers with lower profiles is easier than
spinning off a gigantic RAM instance. From a operations and financial
point of view, it is a better approach.

> 2) Now that Redis 2.2 is more and more expensive memory-wise, just
> account 2x the memory you need if you want to save without delays
> while there is a very high traffic against your instance.

Ouch, expensive. RAM is still the most expensive cloud resource. I
will try replacing my big Redis instance with a bunch of sharded 32
bit instances, more servers == more administration, but it is the
price to pay for raw speed.

> 3) Find allocation tricks so that we can try to enhance the locality
> of data, so that with high probability a list or a sorted set are
> using the same "memory pool" and are more or less in very near pages.

That will be complex, as sets grow, shrink and are constantly updated...

> For sure there is to investigate more in this regard, but it's better
> to do so *after* clustering is available. Maybe it will just fix this
> issue automagically.

+1.

Bjarni Rúnar Einarsson

unread,

Jul 31, 2010, 6:39:40 AM7/31/10

to redi...@googlegroups.com

Excuse me if this has already been discussed, but I wonder if it might help somewhat if the BGSAVE worker released any memory it was finished dumping to disk? After having a very quick look at the code, it doesn't appear to do so today.

This might not work well in practice due to fragmentation and the fact that access patterns and memory pages don't align well, but at least theoretically, if the BGSAVE is done writing everything from a given memory page to disk, then that page could be released and COW wouldn't need to happen if that data ends up being modified later on. Or, if COW has already happened to that page, at least we have reduced the overhead somewhat.

So... will this fail miserably? :-)

It should be relatively easy to implement and test, just add some (optional) cleanup code to the end of rdbSaveObject()?

2010/7/31 Aníbal Rojas <aniba...@gmail.com>

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.

--
Bjarni R. Einarsson

http://beanstalks-project.net/
http://bre.klaki.net/

Reply all

Reply to author

Forward