we have been running redis with vm enabled in production for two
months now, and we would like to share our experience so far.
quick summary: redis with vm sucks.
first, some context. we use redis as the main datastore for a
quickly growing dataset. the total size is about 8 GB (on disk)
now, and projected to grow to 500 GB or more eventually. it
consists of redis hashes, which vary in size between 5 and 300
KB. at any given time, only a small percentage of hashes are in
use.
in theory, this looks like an ideal use case for redis vm, as the
"hot" dataset is small (only a few GB) and the individual values
are rather large.
in practice, we ran into the following problems:
a) stability
as reported in issue #395, redis 2.0.4 behaves very unstable with
both vm and replication, crashing twice on us after less than an
hour once we put load on it. we went back to 2.0.1, and we
considered it rock solid until yesterday, when it unexpectedly
crashed as well.
http://code.google.com/p/redis/issues/detail?id=395
b) memory consumption
as reported in issue #394 (by brett), redis with vm "greatly
exceeds vm-max-memory ... [and] continues to grow." our boxes
have 24 GB of ram (and no swap). we have configured vm as
follows:
vm-enabled yes
vm-max-memory 6gb
vm-page-size 8kb
vm-pages 50304000
vm-max-threads 4
after starting up an empty instance and populating it with 8 GB
of data, the redis process uses 16.9 GB of memory and continues
to grow over time -- first quickly, then more and more slowly.
the growth never completely stops though, so there seems to be
some kind of memory leak.
http://code.google.com/p/redis/issues/detail?id=394
c) persistence / durability
because of a), it would be very desirable to make frequents
backups. but available redis persistence options -- snapshots
and aof -- do not play well with vm.
to create a snapshot (with BGSAVE / SAVE), redis first has to
swap all data into memory before writing it out to disk again.
with a large dataset, this will take extremely long, and soon
becomes unfeasible -- especially in combination with b), as the
snapshotting thread consumes even more memory, and will
frequently be killed by the os before the snapshot can be
completed.
we assume aof (BGREWRITEAOF) behaves essentially the same,
although we have not tried that.
d) replication
unfortunately, replication soon becomes unfeasible as well, as
connecting a new slave forces the master to create a snapshot
internally, see c).
so, to sum this up -- after a while, you are stuck with an
in-memory database that you cannot backup, cannot replicate to
a standby machine, and that will eventually consume all memory
and crash (if it does not crash earlier).
conclusion: redis with vm enabled is pretty much unusable, and we
would really not recommend it to anybody else for production use
at the moment. (at least not as a database, it might work better
as a cache.)
as a workaround, we are going to turn off vm, and let the
application dump / restore individual hashes on-demand to / from
disk -- effectively we are going to write our own virtual memory
implementation in ruby.
although the main focus of redis development is on redis cluster
now, this will only partly solve the problem of large datasets.
to make vm a feasible option, at least a better persistence
strategy is urgently needed -- maybe by turning the swap file
into a durable representation of the whole dataset.
cheers
tim
> in practice, we ran into the following problems:
>
>
> a) stability
>
> as reported in issue #395, redis 2.0.4 behaves very unstable with
> both vm and replication, crashing twice on us after less than an
> hour once we put load on it. we went back to 2.0.1, and we
> considered it rock solid until yesterday, when it unexpectedly
> crashed as well.
>
> http://code.google.com/p/redis/issues/detail?id=395
Hello Tim, in order to get rid of this bug it is fundamental to have
two informations, but as it's hard to reproduce this bug, it's hard
for us without external help.
1) Without replication enabled, is the redis instance stable?
2) What happens switching to 2.2-RC1? Is stable? Even with replication?
Any help on this is really appreciated.
> b) memory consumption
>
> as reported in issue #394 (by brett), redis with vm "greatly
> exceeds vm-max-memory ... [and] continues to grow." our boxes
> have 24 GB of ram (and no swap). we have configured vm as
> follows:
>
> vm-enabled yes
> vm-max-memory 6gb
> vm-page-size 8kb
> vm-pages 50304000
> vm-max-threads 4
Here you probably run out of pages, or have too much keys, or
something like this, the only way to tell is that you post the INFO
output.
Redis is not able to swap keys in any way, nor to put more than a
single value in a single page.
The page-size should be more 64 bytes, than 8kb probably in your use case.
> after starting up an empty instance and populating it with 8 GB
> of data, the redis process uses 16.9 GB of memory and continues
> to grow over time -- first quickly, then more and more slowly.
> the growth never completely stops though, so there seems to be
> some kind of memory leak.
I don't think there is any memory leak, it sounds like a different problem.
With INFO output I can provide more information.
> c) persistence / durability
>
> because of a), it would be very desirable to make frequents
> backups. but available redis persistence options -- snapshots
> and aof -- do not play well with vm.
>
> to create a snapshot (with BGSAVE / SAVE), redis first has to
> swap all data into memory before writing it out to disk again.
> with a large dataset, this will take extremely long, and soon
> becomes unfeasible -- especially in combination with b), as the
> snapshotting thread consumes even more memory, and will
> frequently be killed by the os before the snapshot can be
> completed.
This problem is partially addressed in 2.2 (the child will not use too
much additional memory), but to fix this in a proper way what's needed
is to change the VM implementation so that the data is prefixed by the
length of such a data.
So in order to save the swapped out value there is to do just a read
and a write, without serialization / deserialization of what is stored
inside.
This is a change I want to do. There are also a few changes that can
make swapping out values almost twice as fast that should be added as
well.
> we assume aof (BGREWRITEAOF) behaves essentially the same,
> although we have not tried that.
Yes it is exactly the same as BGSAVE.
> d) replication
>
> unfortunately, replication soon becomes unfeasible as well, as
> connecting a new slave forces the master to create a snapshot
> internally, see c).
>
>
> so, to sum this up -- after a while, you are stuck with an
> in-memory database that you cannot backup, cannot replicate to
> a standby machine, and that will eventually consume all memory
> and crash (if it does not crash earlier).
>
> conclusion: redis with vm enabled is pretty much unusable, and we
> would really not recommend it to anybody else for production use
> at the moment. (at least not as a database, it might work better
> as a cache.)
>
>
> as a workaround, we are going to turn off vm, and let the
> application dump / restore individual hashes on-demand to / from
> disk -- effectively we are going to write our own virtual memory
> implementation in ruby.
The 2.2 VM is almost a rewrite, I think there is to consider how 2.2
behaves ASAP as probably the 2.2 VM is more stable / functional than
2.0.
Also you can likely turn the 2.0 vm into a more stable (and sometimes
much faster) one with this config line:
vm-max-threads 0
If you can acknowledge this it is much appreciated.
This will be also an interesting hint about where the bug is.
> although the main focus of redis development is on redis cluster
> now, this will only partly solve the problem of large datasets.
> to make vm a feasible option, at least a better persistence
> strategy is urgently needed -- maybe by turning the swap file
> into a durable representation of the whole dataset.
This is a problem (turning the swap file into a durable
representation) as it requires a major rewrite and turn the business
the other way around (in memory what is often used, but everything is
anyway flushed on the swap file from time to time).
Currently what I want to do to address this problems is:
1) Check if 2.2 behaves much better as we think it does.
2) Prefix pages into VM with real object length, so we can make
persistence much faster.
3) Possibly mmap() everything for performances.
4) Check how Redis cluster performs, and what the user reaction will be.
5) Even think about removing VM and killing the feature at all... if
needed for the 3.0 release.
I'll work at "2" and other VM enhancements starting from January, but
we also need help from users.
One of the problem of VM is that a very smallish amount of people is
using it, and this brings us to solution "5"...
Cheers,
Salvatore
--
Salvatore 'antirez' Sanfilippo
http://invece.org
"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele
--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
> we are deploying a very big implementation of redis with vm to production next week.
uh, good luck :)
tim
Thank you, make sure to use a good tradeoff between number of pages
and page size. It's a good idea to avoid using too big pages. try with
64 bytes for instance.
Please save any logs in case of stability problems, also make sure to
try if problems go away disconnecting the slave, or alternatively
using vm-max-threads 0.
I'll be very happy to provide support directly during the deployment
or when you got problems.
Thanks!
Salvatore
>> a) stability
> 1) Without replication enabled, is the redis instance stable?
well, i guess we will find out over christmas -- we are running
without a standby slave now.
> 2) What happens switching to 2.2-RC1? Is stable? Even with replication?
if it should crash again, i might try 2.2-RC1 instead, but without
replication.
>> b) memory consumption
> Here you probably run out of pages, or have too much keys, or
> something like this, the only way to tell is that you post the INFO
> output.
vm_conf_max_memory:6442450944
vm_conf_page_size:8192
vm_conf_pages:50304000
vm_stats_used_pages:1110340
vm_stats_swapped_objects:854020
vm_stats_swappin_count:192009
vm_stats_swappout_count:1046029
vm_stats_io_newjobs_len:0
vm_stats_io_processing_len:0
vm_stats_io_processed_len:0
vm_stats_io_active_threads:0
vm_stats_blocked_clients:0
db2:keys=917917,expires=0
db3:keys=13,expires=0
> Also you can likely turn the 2.0 vm into a more stable (and sometimes
> much faster) one with this config line:
>
> vm-max-threads 0
>
> If you can acknowledge this it is much appreciated.
hmmm, i have been toying with the idea, but i am afraid of blocking
other clients -- although we have really fast disks in raid 10.
>> to make vm a feasible option, at least a better persistence
>> strategy is urgently needed -- maybe by turning the swap file
>> into a durable representation of the whole dataset.
>
> This is a problem (turning the swap file into a durable
> representation) as it requires a major rewrite and turn the business
> the other way around (in memory what is often used, but everything is
> anyway flushed on the swap file from time to time).
it is the only viable strategy that i can think of, at least
for really large datasets.
> Currently what I want to do to address this problems is:
>
> 1) Check if 2.2 behaves much better as we think it does.
> 2) Prefix pages into VM with real object length, so we can make
> persistence much faster.
> 3) Possibly mmap() everything for performances.
> 4) Check how Redis cluster performs, and what the user reaction will be.
> 5) Even think about removing VM and killing the feature at all... if
> needed for the 3.0 release.
>
> I'll work at "2" and other VM enhancements starting from January, but
> we also need help from users.
> One of the problem of VM is that a very smallish amount of people is
> using it, and this brings us to solution "5"...
yeah, i understand, and i am afraid my rant might drive the number
even further down ;)
On Thu, Dec 23, 2010 at 4:42 PM, Jonathan Leibiusky <iona...@gmail.com> wrote:Thank you, make sure to use a good tradeoff between number of pages
> we are deploying a very big implementation of redis with vm to production
> next week.
> 4 servers with 16 GB RAM, 12 allocated to redis. we are sharding between the
> 4 servers 1 TB of data, so every redis server will handle 250 GB of data. to
> be able to do that we configured VM and AOF. oh, and redis version is the
> latest 2.2 one.
> will let you know how redis behaves in the next few weeks.
> and if you have any suggestion just let me know, will be happy to try and
> play with the conf and give some feedback.
and page size. It's a good idea to avoid using too big pages. try with
64 bytes for instance.
Please save any logs in case of stability problems, also make sure to
try if problems go away disconnecting the slave, or alternatively
using vm-max-threads 0.
I'll be very happy to provide support directly during the deployment
or when you got problems.
Thanks this is useful
>> 2) What happens switching to 2.2-RC1? Is stable? Even with replication?
>
> if it should crash again, i might try 2.2-RC1 instead, but without
> replication.
As a bonus VM of 2.2 is a major rewrite with the good side effects
that objects are not "fat objects" but just normal Redis objects like
when running without VM enabled.
So it should use a lot less memory.
>>> b) memory consumption
>
>> Here you probably run out of pages, or have too much keys, or
>> something like this, the only way to tell is that you post the INFO
>> output.
>
> vm_conf_max_memory:6442450944
> vm_conf_page_size:8192
> vm_conf_pages:50304000
> vm_stats_used_pages:1110340
> vm_stats_swapped_objects:854020
> vm_stats_swappin_count:192009
> vm_stats_swappout_count:1046029
> vm_stats_io_newjobs_len:0
> vm_stats_io_processing_len:0
> vm_stats_io_processed_len:0
> vm_stats_io_active_threads:0
> vm_stats_blocked_clients:0
> db2:keys=917917,expires=0
> db3:keys=13,expires=0
It seems like everything is already swapped, since total keys ->
917917, swapped -> 854020. Possibly the no swapped part is about
numbers that are shared objects so will never get swapped or something
like this. Hard to tell without knowing exactly the nature of the data
set.
>> Also you can likely turn the 2.0 vm into a more stable (and sometimes
>> much faster) one with this config line:
>>
>> vm-max-threads 0
>>
>> If you can acknowledge this it is much appreciated.
>
> hmmm, i have been toying with the idea, but i am afraid of blocking
> other clients -- although we have really fast disks in raid 10.
Chances are that the VM of 2.0 will be much faster with this
setting... it is blocking but it is much more efficient since there is
no message passing between different threads, clients are not
suspended, and so forth.
>> This is a problem (turning the swap file into a durable
>> representation) as it requires a major rewrite and turn the business
>> the other way around (in memory what is often used, but everything is
>> anyway flushed on the swap file from time to time).
>
> it is the only viable strategy that i can think of, at least
> for really large datasets.
Probably yes, but it's important to realize that the price to pay is,
not "point in time" dump. You have a representation of all the
keys/values on disk, and copy of the most used things on memory (big
win -> vm-maxmemory can always be honored). This is how this could
work:
1 - Every time we read from a key that is already in memory, nothing
special, we read...
2 - Every time we write to a key that is already in memory, we write,
and we put this key into a "need-to-flush-on-disk" queue or alike.
Possibly we wait some second before flushing, configurable, so N
writes will result into a single write to disk.
3 - Every time we read/write from keys not in memory, we load it in
memory (possibly removing some other key in memory using LRU), and
then we do 1) or 2) accordingly.
So persistence is per-key in this design. Not a big deal. For sure a
design much better than the current one. Also it is interesting that
this design will bring the DB online one second after typing
./redis-server.
>> Currently what I want to do to address this problems is:
> yeah, i understand, and i am afraid my rant might drive the number
> even further down ;)
Unfortunately your rant is more than justified since while VM works in
some scenario it is hardly a good implementation currently.
There is to understand if it's the way to go, if it's worth improving
for instance with the design above, if it's better to kill it, and so
forth.
auto-replying just to add, it is important to understand one side
effect of this design, that is, every access to a non existing key
will cost us a disk access.
This is not too bad as most applications rarely access non existing
keys, if not for creating it.
There is also the escamotage of negative caching, that is, reserving
some memory or using a good heuristic to take in memory the
information that a given key is not on disk (a bloom filter can help
here).
But there are very good things about this solution, and a good part
can be implemented on top of our current VM layer.
>> vm_stats_swapped_objects:854020
>> vm_stats_swappin_count:192009
>> vm_stats_swappout_count:1046029
>> db2:keys=917917,expires=0
>> db3:keys=13,expires=0
>
> It seems like everything is already swapped, since total keys ->
> 917917, swapped -> 854020. Possibly the no swapped part is about
> numbers that are shared objects so will never get swapped or something
> like this. Hard to tell without knowing exactly the nature of the data
> set.
the values in db3 are shared data structures that are accessed
all the time. db2 contains only hashes, the bulk of the dataset.
each hash represents a user and is only needed when the user
is online, so these can be swapped out most of the time.
>>> This is a problem (turning the swap file into a durable
>>> representation) as it requires a major rewrite and turn the business
>>> the other way around (in memory what is often used, but everything is
>>> anyway flushed on the swap file from time to time).
>>
>> it is the only viable strategy that i can think of, at least
>> for really large datasets.
>
> Probably yes, but it's important to realize that the price to pay is,
> not "point in time" dump.
that is a price i would be very willing to pay -- it would mean
that we could never lose more than x minutes of updates per user,
which is good enough (at least for our use case, social games).
> You have a representation of all the
> keys/values on disk, and copy of the most used things on memory (big
> win -> vm-maxmemory can always be honored).
yeah, having control over memory consumed would be awesome. :)
> This is how this could work:
>
> 1 - Every time we read from a key that is already in memory, nothing
> special, we read...
> 2 - Every time we write to a key that is already in memory, we write,
> and we put this key into a "need-to-flush-on-disk" queue or alike.
> Possibly we wait some second before flushing, configurable, so N
> writes will result into a single write to disk.
this part we have already implemented in our application, and it
works really well. we flush hashes to disk 10 minutes after they
are first modified, and every 10 minutes again as long as there
are new modifications.
we chose 10 minutes because user sessions only last 5 minutes on
average, so we save a lot of disk io this way.
the flushing strategy (every x minutes, after x modifications ...)
would be a very important knob in the new implementation, i think.
> 3 - Every time we read/write from keys not in memory, we load it in
> memory (possibly removing some other key in memory using LRU), and
> then we do 1) or 2) accordingly.
>
> So persistence is per-key in this design.
exactly. the last part we still have to get working -- we plan to use
key expiration -- and then we can switch off vm.
> For sure a
> design much better than the current one. Also it is interesting that
> this design will bring the DB online one second after typing
> ./redis-server.
yeah, i am very much looking forward to that as well. :)
This is actually no longer VM, is a different persistent engine, as it
makes sense to use it even when RAM is not a problem in order to
obtain fast restart, no forking child, and so forth, in exchange for
performance if the application is very write intensive, or
alternatively in exchange to durability (if there are many writes even
if the whole database fits in RAM, we need to flush this writes on
disk ASAP, consuming CPU and possibly not being able to flush data on
disk as fast as configured).
There is a big open problem, replication. The current replication
design assumes we are able to transfer a point-in-time dump, but with
such a persistence engine this is no longer possible. So there is to
invent something new... but I doubt there are good solutions to this
problem.
Cheers,
Salvatore
I just found a solution to this (it's not different than current VM
itself, just suspend writes to have a stable dump on disk, fork, do a
scan to produce the .rdb, or even just copy it and make the slave able
to understand this format. Both ways it's mostly as fast as copying a
file as we'll make sure our btree dump values will use the same
serialization of .rdb).
And there are tons of other optimizations we can implement. For
instance when storing ziplists or zipmaps and so forth, just store
them as blobs as they are already blobs. If you are into Redis
internals you should know what an improvement this alone can be.
Btw I'll produce a detailed writeup in my blog about all this future
directions. We know since a lot of time that VM implementation is not
optimal at all, but the direction was to make cluster cool and drop
this support eventually. Maybe we have better options, and at least
there is to try with what seems a much saner design.
I'll post a message here when I'll post the blog entry on antirez.com
Cheers,
another suggestion: we store each value in a separate file, using
a two-level directory tree like bigdis. this works really well and
makes the on-disk representation less opaque than one big file.
one concrete advantage is that we can do incremental backups using
rsync -- something like poor man's snapshots (as unfortunately we
don't have ZFS).
> Btw I'll produce a detailed writeup in my blog about all this future
> directions.
good, very much looking forward to that!
--
> another suggestion: we store each value in a separate file, using
> a two-level directory tree like bigdis. this works really well and
> makes the on-disk representation less opaque than one big file.
> one concrete advantage is that we can do incremental backups using
> rsync -- something like poor man's snapshots (as unfortunately we
> don't have ZFS).
plus, we don't have to pre-allocate a fixed size swap file in advance,
and thus we avoid the headache of deciding on the "right" number and
size of swap pages, and the size of the dataset is not artificially
limited in any way.
tim
What you are doing in this way is using the filesystem as a btree
implementation. If it is a good implementation for holding tons of
small files, it's going to work well, otherwise, it's going to work
not very well :)
I used this in bigdis as the bigdis prototype was designed for big
values. In this contest the overhead of the filesystem is not going to
be so big.
But for millions or billions of small keys -> values, is this going to
work well? In what file systems?
How much performances / efficiency we lose compared to a real btree in
a single file?
We need to have good answers to this questions in order to pick the best option.
Cheers,
Salvatore
> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>
--
> I ran on a smaller version of the same data redis-stat
> (https://github.com/antirez/redis-tools) and it told me that the best size
> for the page is 4096 bytes. Do you still think it is better to leave it 64
> bytes?
>
> Oh and using AOF and having 16 GB RAM, giving redis 12 GB do you think it is
> OK or in order to do BGREWRITEAOF once a day (during low traffic hours) it
> is better to have more free RAM for the child process that regenerate the
> aof fie?
redis-stat is really a prototype, better to use 64 bytes as page size ;)
AOF is perfectly fine, but the BGREWRITEAOF is up to the amount of
writes you have.
What I suggest is to try how much it grows every minute on average, so
that you make
sure to rewrite it when it will get too big compared to how big it is
just after a BGREWRITEAOF completed successfully.
For instance if you have mostly reads, you can rewrite just every 24 hours.
Jeremy I share your vision in many ways.
In a sentence I could say that while Redis is "one of the many" from
the point of view of disk persistence, it is pretty unique as an
in-memory DB and in the operations that are efficient since it is in
memory (like sorted sets).
I want to improve persistence, VM (that is just going to be a
different persistence engine, no longer a real virtual memory if we
implement the planned changes), as they are important for the single
node business, being this node part of a cluster or just a single
instance, but cluster will be the major focus of 2011 for sure.
Possibly working only at Redis cluster is not a good idea, so we'll
try to carry on this other work at the same time. I hope that for the
end of January I can publish the cluster code in his alpha stage but
with the main ideas already in place so that I and Pieter can start
working together at it. I'm not doing this now just because I think
it's very hard to collaborate on a design that is too much a work in
progress, and that the initial design of everything is bester done by
single persons (not just in computer science, in almost everything).
this approach certainly *does* scale to millions of keys -- we
currently use two-digit decimal directories, i.e. 00 to 99, on
two levels, so 10000 "leaf" directories in total. given a million
evenly distributed keys, this means around 100 values per dir.
two levels of two-digit hex directories would result in 65536 leaf
directories, or 65 millions keys with around 1000 keys per dir.
making it scale to billions of keys might be tricky though ;)
tim
> making it scale to billions of keys might be tricky though ;)
Yes, there is also a performance issue... and btree.c/h inside sqlite
are very tempting ;)
Cheers,
Salvatore
ubuntu@peking:/data/users$ time find . | wc -l
943431
real 0m0.634s
user 0m0.260s
sys 0m0.400s
but yeah, i see your point :)
cheers
tim
i still think there is a memory leak (issue #394).
here is some data (output of 'ps aux') that i collected over
the last week. it shows the growing memory consumption of our
redis process (no slaves connected, no persistence, vm enabled,
vm-max-memory 6gb):
USER PID %CPU %MEM VSZ RSS
Mon Dec 27 07:27:37 CET 2010
redis 24044 20.7 76.2 19091212 18871980
Wed Dec 29 20:55:34 CET 2010
redis 24044 23.8 78.3 19617544 19379564
Thu Dec 30 14:15:00 CET 2010
redis 24044 23.9 78.3 19619592 19381604
Fri Dec 31 12:41:20 CET 2010
redis 24044 24.4 78.9 19754760 19539692
Sat Jan 1 20:55:54 CET 2011
redis 24044 24.4 79.5 19911428 19674032
Sun Jan 2 23:08:13 CET 2011
redis 24044 25.1 79.7 19969796 19735380
redis is doing between 2000 and 8000 operations per second,
and we seem to be losing almost a gig of memory per week.
(info has constantly been reporting 'used_memory_human:6.00G'
the whole time.)
cheers
tim
On 2010-12-23, at 16:24 , Salvatore Sanfilippo wrote:
> On Thu, Dec 23, 2010 at 4:07 PM, Tim Lossen <t...@lossen.de> wrote:
>> after starting up an empty instance and populating it with 8 GB
>> of data, the redis process uses 16.9 GB of memory and continues
>> to grow over time -- first quickly, then more and more slowly.
>> the growth never completely stops though, so there seems to be
>> some kind of memory leak.
>
> I don't think there is any memory leak, it sounds like a different problem.
Tim, Pieter,
indeed it is impossible that it's a memory leak since the memory
reported by Redis is always 6GB.
Redis traps every allocation made using a wrapper (zmalloc.c), so
memory leaks will have the effect of showing the memory report
increasing over time.
I think this is definitely fragmentation, and is probably related to
the use of hashes. My guess is that what happens in datasets where
there are a lot of specially encoded hashes that have the property
that, in general, all grow the same size monotonically, is that old
allocations are discarded almost always for bigger ones.
When hashes are not integer encoded this is not a problem since the
sigle objects are always the same size, but here is different. The
fragmentation will stop at some point, it should not be a never ending
process.
The solution to this problem is a slab allocator, but may have the
effect of using much more ram in the first instance, as memory is
allocated according to sizes that are usually power of two or alike.
Another problem you are experiencing is that 2.0 used a larger Redis
object when VM was enabled.
I strongly suggest upgrading to 2.2.
Later today I'll send you a small ruby script you can run against your
instance to sample some information about your dataset (it's something
safe to run in production), but please if you can upgrade to 2.2 asap.
Before you upgrade to 2.2 I'll add later today a CONFIG SET option so
that you can change vm-max-memory dynamically at runtime. So you can
start with smaller memory and use more and more memory as you see that
fragmentation will get stable, but anyway, with 2.2 you are going to
make a much better use of your memory, because:
1) LRU for eviction in VM.
2) Smaller objects.
That said, now that there is no replica, is my understanding correct
that Redis 2.0.x is no longer crashing with VM enabled? This would
confirm our guesses that the problem is replication + VM, so we can
create a testing environment to reproduce the problem and fix it, and
check if it also happens in 2.2 (I'm optimistic about it since it's
almost a complete rewrite).
About the future of "big data Redis" I'm working hard at diskstore,
the idea we discussed in this thread.
It's already working and I'll send a detailed email about it today in
the mailing list, I really need help in evaluating how well it works,
but the first results are really encouraging.
thanks for looking into this matter.
> indeed it is impossible that it's a memory leak since the memory
> reported by Redis is always 6GB.
> I think this is definitely fragmentation, and is probably related to
> the use of hashes.
yes, very likely. in fact, we have moved all non-hash values to a
different redis instance yesterday, and the RSS is still increasing.
> Later today I'll send you a small ruby script you can run against your
> instance to sample some information about your dataset (it's something
> safe to run in production),
ok, i am happy to help to nail down this problem, if possible.
> but please if you can upgrade to 2.2 asap.
hmmmm .... we are not very keen on that. this is a production system,
and every restart of redis involves around 30 minutes of downtime, as
we have to load all data into the new instance. if redis 2.2 should not
perform as expected, we would incur yet another downtime to switch back.
> with 2.2 you are going to
> make a much better use of your memory, because:
>
> 1) LRU for eviction in VM.
> 2) Smaller objects.
interesting -- what is the eviction strategy in 2.0, not LRU?
> That said, now that there is no replica, is my understanding correct
> that Redis 2.0.x is no longer crashing with VM enabled?
yes, we are running without replication now. hard to say if this has
improved stability, though: the last time (= with replication) redis
crashed after 20 days, current uptime is 12 days.
> About the future of "big data Redis" I'm working hard at diskstore,
> the idea we discussed in this thread.
> It's already working and I'll send a detailed email about it today in
> the mailing list
cool, very much looking forward to that!
we have been working on our own application-specific "restore-on-demand"
implementation as well, and once this is stable, we plan to finally turn
off vm in a few days.
--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
Hello Marc,
you are right, it is important to consider what was wrong with VM.
But I don't think it was the granularity.
Most objects in Redis are small, especially when they are pieces of
bigger aggregate data types.
So even to just take the pointers to data on disk is not just hard,
but also mostly useless as it takes many times the same memory.
Another reason is that with special encoded data types we have now
(for hashes, lists, and sets under a given size) it makes a lot of
sense to write them as a single blob on disk, as they are already this
way in memory.
Under all this assumptions, there is *no* good solution for handling
on disk large aggregate data types, but:
1) Change the application so that if possible smaller aggregated types are used.
2) Implement the data structures we support *on disk*.
or simply... if you have a big sorted set, use Redis in the native
in-memory way.
Btw approach '1' can be done by the guy writing the application, so
that diskstore becomes viable.
Approach '2' is a nightmare, as all the atomic stuff we do are fast
and trivial as they are done in RAM.
So that said, I think both the idea of having on-disk representation
of what we have in ram, in a way that are hackable directly on disk,
or having an finer granularity, is not going to help, and was not the
problem with VM.
So what was the problem with VM? :)
1) Was not able to swap keys. This was an huge mistake. Typical
application of Redis diskstore will be to have millions of objects
with a small hash inside. If you can't swap keys, keys alone are too
much data in RAM already for a small instance. I think that swapping
sub-values is reiterating this error indeed.
2) We had most of the data on disk, but this stored representation was
a trow-away business, not useful for persistence. So we ended having
the same data on disk on the swap file, on memory, and on .rdb for
persistency. Far from ideal.
3) It was not ok with the persistence "COW style" model of Redis. In
order to save the dataset there was to basically access the whole swap
file, value after value, and it was incredibly slow.
4) It was too much similar to in-memory from the point of view of
tradeoffs. Same consistency, same persistence, same slow start (but
much slower since there was swapping involved). The new system is
really an alternative so the use cases are pretty distinct, it's not
very hard to tell when in-memory can work, or diskstore can work, or
neither (or both).
5) The implementation was too complex. This was definitely fixable, as
the implementation of diskstore was theoretically applicable to the
old VM code as well. Proof: diskstore worked fine after 5 days of
work. To reach the same level of stability VM took 1 month of work.
To get diskstore better I think we need people applying it in the real
world ASAP.
And for this we need to ship a stable enough implementation ASAP. But
we are not too far ;)
1) Change the application so that if possible smaller aggregated types are used.
2) Implement the data structures we support *on disk*.
or simply... if you have a big sorted set, use Redis in the native
in-memory way.
Btw approach '1' can be done by the guy writing the application, so
that diskstore becomes viable.
--
USER PID %CPU %MEM VSZ RSS
Tue Jan 4 12:10:06 CET 2011
redis 24044 24.6 80.3 20061956 19880124
Wed Jan 5 18:06:25 CET 2011
redis 24044 24.3 80.3 20061956 19871592
Thu Jan 6 10:24:08 CET 2011
redis 24044 24.0 80.3 20061956 19871592
Fri Jan 7 15:27:28 CET 2011
redis 24044 23.6 80.3 20061956 19871584
the most likely culprit for memory fragmentation seem to be
sorted sets now. for example, we use one sorted set to keep
track of currently active users, and it is continuously being
updated (ZADD / ZREMRANGEBYSCORE). this is one of the values
we moved to a separate redis instance (without vm).
tim
On 2011-01-04, at 11:46 , Tim Lossen wrote:
>> indeed it is impossible that it's a memory leak since the memory
>> reported by Redis is always 6GB.
>
>> I think this is definitely fragmentation, and is probably related to
>> the use of hashes.
>
> yes, very likely. in fact, we have moved all non-hash values to a
> different redis instance yesterday, and the RSS is still increasing.
> Let me play dumb for a second here: How can I, the guy writing the
> application, use smaller aggregated types? If I currently have a zset which
> has 1M elements, and I want to set an upper limit of some optimal size, say
> 1000, how do I segment this (and still be able to do zinter, zunion, etc.?)
In that case you simply don't want to use diskstore but memory.
Let's put this into another shape:
use case for in-memory redis: everything supported by the data model.
but I've not so much memory, my dataset is too large!
Ok, then you can use diskstore, but it will work well for a subset of
use cases, that are, guess what, very similar to the use cases you can
model with an on-disk store. That is mainly plain key -> value
business, or anything where value is an aggregate data type that is
not too large.
This menas a lot of use cases: representing objects with small hashes.
Short lists for capped timelines. Small sets to link resources to a
list of tags. And so forth.
Your proposal of swapping single values does not work to solve this
for a number of reasons, the only alternative would be to directly
model our types (lists, sets, hashes, ...) on-disk, with on-disk data
structures that can directly support this kind of operations. But this
is not Redis business, we can't make all the users happy.
On the other hand, there are a few cases where a problem appear to be
needing large lists, sets, sorted sets, and so forth, but instead you
can model it using small data structures. In all this cases you can
turn your problem that has good solutions only with the in-memory
backend into one with good solutions with diskstore.
Example: long timeline? Segment it into N keys where every is a list
with max 500 elements. Since timelines have usually an access pattern
where most of the time you need to retrieve only the latest entries
this will work very well, and if somebody keeps hitting "read more"
you'll retrieve data from the next keys.
Another example: Big set to check if something exists in a given
category? Instead of using a set use plain keys, with prefixes. So
every key can go inside and outside the diskstore cache independently.
And so forth.
Cheers,
Salvatore
--
Salvatore 'antirez' Sanfilippo
open source developer - VMware
Cheers,
Salvatore
> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>
--
Salvatore 'antirez' Sanfilippo
open source developer - VMware
Hello Sam, thank you for your email,
> We were running Redis 2.0 when all this happened, but as I understand
> it sorted sets didn't get much optimisation between 2.0 and 2.2?
Sorted sets use 20% less memory for sure in 2.2. Any sorted set I
mean. They also perform less allocations so it is likely that they are
able to fragment memory less.
Btw what I think is that this problem is evident with sorted sets
since they allocate memory in a special way, that is, sorted set nodes
are not all the same length, there are sorted sets with just one link
and sorted sets with up to 15 links. So I think that what happen is
the following:
when you free a sorted set due to VM (but this will happen similarly
if you have a work load deleting sorted sets often), you end with
spaces of different sizes. Then Redis allocates more small objects,
that from time to time will use space that was bigger, so the next
time there will be to allocate a, let's say, 15 links node, old space
will be hard to reclaim as many of this pieces are now fragmented, and
new memory will be used. And so forth forever...
> Do you think the new diskstore branch is likely to get around this
> behaviour - perhaps because of the simpler representation of values on
> disk?
The problem described here is trivially solvable using a slab
allocator that can be implemented just touching zmalloc.c, so we can
implement this directly in 2.2 before shipping, as an optional
configuration flag (possibly set to yes by default, not sure). Why
optional? Because with work loads where there is no big risk of
fragmentation the usual allocator will use less memory. Slab
allocators allocate memory at segmented sizes (for instance power of
two) up to a given size, so it's a tradeoff between space used and
fragmentation.
Cheers,
Salvatore
> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>
--
Salvatore 'antirez' Sanfilippo
open source developer - VMware
If they get bigger and bigger, how much sorted sets, what the work
load is, and so forth.
Thanks!
Salvatore
About that: is there somebody listening here experimenting
fragmentation without VM?
Thanks,
Salvatore
On Jan 8, 2011 2:45 AM, "Salvatore Sanfilippo" <ant...@gmail.com> wrote:
> The problem described here is trivially solvable using a slab
> allocator that can be implemented just touching zmalloc.c, so we can
> implement this directly in 2.2 before shipping, as an optional
> configuration flag (possibly set to yes by default, not sure). Why
> optional? Because with work loads where there is no big risk of
> fragmentation the usual allocator will use less memory. Slab
> allocators allocate memory at segmented sizes (for instance power of
> two) up to a given size, so it's a tradeoff between space used and
> fragmentation.
Could you not simply round sorted-set allocations up to get a smaller set of size classes? That would probably reduce zset fragmentation without affecting all other allocations.
Mike
Hello Mike, this is not possible since from the smallest to the
greatest zset node the space difference is too big.
Also the same can happen with other allocation patterns not involving
zset (for instance setting strings of increasing size, via APPEND, may
lead to similar problems).
Cheers,
Salvatore
> Mike
For most allocators, once things are beyond a certain size they get a
dedicated mapping, meaning that they don't contribute to
fragmentation. Everything below that can be rounded up into a bucket
(power of two is popular, but there are probably other choices)
meaning that you get the reuse you want for avoiding fragmentation.
Another option is copying compaction, since the set of references to a
given object is pretty small and easy to find, but that would require
a more invasive change.
On the topic of allocators, there's a fantastic post by Jason Evans
about the work he did for fragmentation and performance improvements
on jemalloc, for Facebook's use:
https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919
. jemalloc might be a good allocator choice in the future; we're using
it for Firefox, and we're very sensitive to both allocation speed and
fragmentation.
Mike
redis_version:2.0.1
uptime_in_days:36
changes_since_last_save:3139419007
vm_stats_swappin_count:16858675
vm_stats_swappout_count:18623203
db2:keys=1807395,expires=0
3 billion operations, without any hiccup so far. memory usage
is now completely stable as well:
USER PID %CPU %MEM VSZ RSS
Tue Jan 11 09:51:50 CET 2011
redis 24044 23.0 80.3 20061956 19871576
Fri Jan 28 16:30:44 CET 2011
redis 24044 24.8 80.3 20061956 19871516
so if you can live without persistence (i.e. for a large cache),
vm might still be an interesting option.
tim
On 2010-12-23, at 16:07 , Tim Lossen wrote:
> hello,
>
> we have been running redis with vm enabled in production for two
> months now, and we would like to share our experience so far.
>
> quick summary: redis with vm sucks.
>
>
> first, some context. we use redis as the main datastore for a
> quickly growing dataset. the total size is about 8 GB (on disk)
> now, and projected to grow to 500 GB or more eventually. it
> consists of redis hashes, which vary in size between 5 and 300
> KB. at any given time, only a small percentage of hashes are in
> use.
>
> in theory, this looks like an ideal use case for redis vm, as the
> "hot" dataset is small (only a few GB) and the individual values
> are rather large.
>
> in practice, we ran into the following problems:
>
>
> a) stability
>
> as reported in issue #395, redis 2.0.4 behaves very unstable with
> both vm and replication, crashing twice on us after less than an
> hour once we put load on it. we went back to 2.0.1, and we
> considered it rock solid until yesterday, when it unexpectedly
> crashed as well.
>
> http://code.google.com/p/redis/issues/detail?id=395
>
>
> b) memory consumption
>
> as reported in issue #394 (by brett), redis with vm "greatly
> exceeds vm-max-memory ... [and] continues to grow." our boxes
> have 24 GB of ram (and no swap). we have configured vm as
> follows:
>
> vm-enabled yes
> vm-max-memory 6gb
> vm-page-size 8kb
> vm-pages 50304000
> vm-max-threads 4
>
> after starting up an empty instance and populating it with 8 GB
> of data, the redis process uses 16.9 GB of memory and continues
> to grow over time -- first quickly, then more and more slowly.
> the growth never completely stops though, so there seems to be
> some kind of memory leak.
>
> http://code.google.com/p/redis/issues/detail?id=394
>
>
> c) persistence / durability
>
> because of a), it would be very desirable to make frequents
> backups. but available redis persistence options -- snapshots
> and aof -- do not play well with vm.
>
> to create a snapshot (with BGSAVE / SAVE), redis first has to
> swap all data into memory before writing it out to disk again.
> with a large dataset, this will take extremely long, and soon
> becomes unfeasible -- especially in combination with b), as the
> snapshotting thread consumes even more memory, and will
> frequently be killed by the os before the snapshot can be
> completed.
>
> we assume aof (BGREWRITEAOF) behaves essentially the same,
> although we have not tried that.
>
>
> d) replication
>
> unfortunately, replication soon becomes unfeasible as well, as
> connecting a new slave forces the master to create a snapshot
> internally, see c).
>
>
> so, to sum this up -- after a while, you are stuck with an
> in-memory database that you cannot backup, cannot replicate to
> a standby machine, and that will eventually consume all memory
> and crash (if it does not crash earlier).
>
> conclusion: redis with vm enabled is pretty much unusable, and we
> would really not recommend it to anybody else for production use
> at the moment. (at least not as a database, it might work better
> as a cache.)
>
>
> as a workaround, we are going to turn off vm, and let the
> application dump / restore individual hashes on-demand to / from
> disk -- effectively we are going to write our own virtual memory
> implementation in ruby.
>
> although the main focus of redis development is on redis cluster
> now, this will only partly solve the problem of large datasets.
> to make vm a feasible option, at least a better persistence
> strategy is urgently needed -- maybe by turning the swap file
> into a durable representation of the whole dataset.
>
>
> cheers
> tim
>
> --
> http://tim.lossen.de
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
there is clearly some problem about VM + replication in 2.0, that can
lead to a crash.
I think the problem should not be present in 2.2.
And there is, with 2.2 also, the problem with persistence that is
simply too slow...
Another major problem remains the fact that it is not an option when
the problem is having a lot of keys, as keys can't be swapped.
Hopefully diskstore can fix all this issues. For instance with
diskstore BGSAVE works very well already, and uses a thread. But
things should seriously improve once we'll provide the B-tree storage
option. Today I just finished the on disk allocator that is the base
for my tree implementation, so things should start moving faster now
about all this.
Cheers,
Salvatore
--
used_memory:6442442048
used_memory_human:6.00G
this is consistent with our configuration:
vm-max-memory 6gb
tim
instead, we have a cronjob running that dumps hashes to disk
which have recently been modified, storing them as yaml files.
if redis should crash, we'd import all the yaml files into a
new redis instance (takes around 30 minutes) and we'd lose about
10 minutes of unsaved changes.
we hope to replace this homegrown diskstore precursor with the
real thing, eventually :)
tim
yes, fast local disks in a raid1 setup.
> Why aren't you using AOF instead of yaml files?
AOF wouldn't work in our case (see first message of this thread).
however, you could consider the yaml files as an "in-place" AOF
that never needs to be compacted.
tim