--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
Hello Andy, indeed in general what happens is the following:
COW gets worse and worse proportionally to:
1) Saving time
2) Number of read and even worse write operations performed against
data while the child is saving
So if there is a very high traffic, the only viable solutions are:
1) A saving slave
2) Using two times the RAM needed, so that there is 50% of RAM free to
perform the copy, but still there is to pay the penalty of copying
every page even if the system at least will not start swapping.
> If you're using a lot of ZSETs, there's two gotcha's:
> 1- they take significantly longer to stream/write during a save than any
> other data type.
This is due to the fact that turning floats into strings is a slow
process. This improved at lot in 2.0 compared to 1.2.x but still it's
not as fast as serializing another data type. If your sorted sets are
actually "integers" that is, numbers without decimal part, this will
get a bit faster.
This is definitely something we can try to improve in the future.
> 2- they're generally much larger from a memory footprint perspective than
> other data types. (for reference, we have ~1M ZSETs on an m1.large (8gigs
> total) and are forced to use VM.)
This is more or less unavoidable. In order to meet the big-O stuff
that sorted set promise there is to take two "views" of the same data,
that is, as a tree (skip list in our specific case) and an hash table
at the same time.
> Like I said in IRC, best scenario we've found is use replication, turn off
> saving on the master, and only do saves from the slave. It's kinda
> frustrating.
We can try to make it better, but there is some fundamental
theoretical limit about this:
When you want to do point-in-time saving, like BGSAVE and BGREWRITEAOF
are doing, the only possibility is to remember how the dataset used to
be. We are doing this via COW semantic and fork(), another possibility
is to track changes and remember this changes while saving. It's much
more complex but anyway requires a memory proportional to the number
of updates performed.
So to have at the same time sorted sets updated at blazing speed, and
point-in-time saving (being it .rdb saves or non-blocking log
rewriting) the only alternative is using 2x memory, in the form of a
single box with more memory or as an alternative of a slave/master
pair.
What we can try to do is to lower the constants: less percentage of
memory used, faster saving, ...
Cheers,
Salvatore
--
Salvatore 'antirez' Sanfilippo
http://invece.org
"Once you have something that grows faster than education grows,
you’re always going to get a pop culture.", Alan Kay
We can try to make it better, but there is some fundamentaltheoretical limit about this:
When you want to do point-in-time saving, like BGSAVE and BGREWRITEAOF
are doing, the only possibility is to remember how the dataset used to
be. We are doing this via COW semantic and fork(), another possibility
is to track changes and remember this changes while saving. It's much
more complex but anyway requires a memory proportional to the number
of updates performed.
- Josiah
On Thu, Jul 29, 2010 at 2:58 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote:
Thanks, in my case there is no such thing as a low write window.
I am thinking that sharding to lower memory and updates is the way to go.
--
Aníbal Rojas
Ruby on Rails Web Developer
http://www.google.com/profiles/anibalrojas
Assuming a dual-write scenario; a single pass of the script could
handle synchronization, at which point you could decommission your
single redis server.
- Josiah
2010/7/29 Aníbal Rojas <aniba...@gmail.com>:
Interesting. Gist? Github? And thanks.
--
Aníbal Rojas
Ruby on Rails Web Developer
http://www.google.com/profiles/anibalrojas
On Thu, Jul 29, 2010 at 7:09 PM, Josiah Carlson
Something like a last resource, to pull out data from a non persistible Redis instance, I hope not to get there. I this case the DEBUG command proposed by Salvatore would come handy.
On Jul 29, 2010 7:37 PM, "Josiah Carlson" <josiah....@gmail.com> wrote:
http://gist.github.com/499552
2010/7/29 Aníbal Rojas <aniba...@gmail.com>:
> Josiah,
>
> Interesting. Gist? Github? And thanks...
The problem with doing the SAVE operation incrementally, as the script
provided by Josiah does, is that the snapshot of the k/v-space is not
consistent. If there are writes to parts of the keyspace that were
already saved, there is no way to recover those. And, if you think of
storing some kind of AOF for those parts, you still have the problem
of writes to parts of the keyspace that were not covered yet. This,
together with composite operations (SUNIONSTORE, SORT STORE, etc) make
for a very hard problem to solve within the same process, where again
the fork() approach gives us the results we want.
Cheers,
Pieter
2010/7/30 Aníbal Rojas <aniba...@gmail.com>:
On Jul 30, 2010, at 12:48 AM, Pieter Noordhuis <pcnoo...@gmail.com> wrote:
> Andy: this indeed sounds like a very obvious solution but it comes
> with numerous drawbacks. For instance, when this secondary hash table
> is active and there is a write against a large ZSET for instance, we
> need to create a copy to the new hash table. Unlike disk-based k/v
> stores, Redis stores everything in memory, which causes a sorted set
> to be stored in many fragments. The only way to copy it, is to
> literally loop over all the elements, allocate some memory for every
> new element and store it. If indeed it was as easy as just copying a
> few megabytes of memory, this was a much easier problem to solve.
> However, this looping will take much time and render (the initial)
> writes to Redis after SAVE was issues *very* slow, in contrary to the
> current fork() approach.
>
Ya, I figured this would be an issue. How does the vm implementation deal with this? Seems like it'd be a similar problem where each element of the zset would have to be restored individually.
It'd be interesting to see how much time it would really take to copy a large zest value -- say 1000 or 10000 elements. The performance impact may still be far better than fork+cow causing the entire instance to be unresponsive for minutes. Seems like a single key being unresponsive for a second is the lesser of two evils here.
Currently, the only environment where fork() is causing issues is EC2.
There are no reports of this issue in a non-virtualized environment.
The problem with the key-level copy-on-write solution is that it is
hard to predict what is going to happen performance-wise after a
(BG)SAVE. Apart from this specific fork() issue on EC2, we know that
issuing a BGSAVE will hardly affect throughput (there is only the
initial overhead of calling fork()), which is important to a lot of
users. Other than that, I very much appreciate this discussion and
it's important to keep thinking about ways to improve this.
Cheers,
Pieter
My script was more of an "oy! emergency!" solution (which I've used in
the past, incidentally). There are tricks that can be done for
dual-writing scenarios (for switching from a single instance to a
sharded instance as Andy had talked about), but there are race
conditions with all of the sequence objects in that case.
Overall, it's a tough problem. About the only low-memory way of doing
it is AOF, but then there is the AOF rewriting, which necessarily
takes more memory (though there are some multi-pass tricks that
perform well with a modest amount of memory, but do take multiple
passes over the data).
- Josiah
Hello Andy,
what you described falls in the (only possible) category of "copy on
write" that I described in this thread, but with different (better)
constant factors.
COW is doing exactly this currently, that is, to copy values when they
are modified. But this is currently done in a per-age basis. Instead
with your approach this will be done in a per-key basis.
Both have problems as for instance it's much better to just copy a
page after an LSET than a 10000 elements list, so sometimes one wins,
some times the other one, but the code complexity is completely
different, as the current solution is basically for free from the
point of view of complexity.
What I think is that in the short term there is not too much to do
about that *but* to at least remove the COW when reading values, that
is, set a flag in the configuration file that will cause a value to be
always copied by addReply() instead to use the refcount. This will be
much better from the point of view of COW for read operations.
Long term instead we can investigate this: to incrementally compact
the AOF file from another unrelated process. So there will never be
the need to call BGREWRITEAOF.
This is *very* hard when you have aggregate types in a KV store,
especially since the AOF is composed of all the possible commands
Redis is able to handle, but maybe there is some solution. Probably
this is not a real world possibility but it's worth investigating.
Btw every solution is *mathematically* bound to copy the data that is
modified while the server is running. It's not something that you can
escape when the need is point-in-time dumps. So the real world
solutions are IMHO:
1) Redis-cluster... we can just run many instances that will save
independently. There will be no longer a unique big instance.
2) Now that Redis 2.2 is more and more expensive memory-wise, just
account 2x the memory you need if you want to save without delays
while there is a very high traffic against your instance.
3) Find allocation tricks so that we can try to enhance the locality
of data, so that with high probability a list or a sorted set are
using the same "memory pool" and are more or less in very near pages.
For sure there is to investigate more in this regard, but it's better
to do so *after* clustering is available. Maybe it will just fix this
issue automagically.
A persistence only process, yes.
> This is *very* hard when you have aggregate types in a KV store,
> especially since the AOF is composed of all the possible commands
> Redis is able to handle, but maybe there is some solution. Probably
> this is not a real world possibility but it's worth investigating.
I realize it is complex...
> Btw every solution is *mathematically* bound to copy the data that is
> modified while the server is running. It's not something that you can
> escape when the need is point-in-time dumps. So the real world
> solutions are IMHO:
>
> 1) Redis-cluster... we can just run many instances that will save
> independently. There will be no longer a unique big instance.
That's it, throwing more servers with lower profiles is easier than
spinning off a gigantic RAM instance. From a operations and financial
point of view, it is a better approach.
> 2) Now that Redis 2.2 is more and more expensive memory-wise, just
> account 2x the memory you need if you want to save without delays
> while there is a very high traffic against your instance.
Ouch, expensive. RAM is still the most expensive cloud resource. I
will try replacing my big Redis instance with a bunch of sharded 32
bit instances, more servers == more administration, but it is the
price to pay for raw speed.
> 3) Find allocation tricks so that we can try to enhance the locality
> of data, so that with high probability a list or a sorted set are
> using the same "memory pool" and are more or less in very near pages.
That will be complex, as sets grow, shrink and are constantly updated...
> For sure there is to investigate more in this regard, but it's better
> to do so *after* clustering is available. Maybe it will just fix this
> issue automagically.
+1.
--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.