vm considered harmful

Tim Lossen

unread,

Dec 23, 2010, 10:07:43 AM12/23/10

to redi...@googlegroups.com

hello,

we have been running redis with vm enabled in production for two
months now, and we would like to share our experience so far.

quick summary: redis with vm sucks.

first, some context. we use redis as the main datastore for a
quickly growing dataset. the total size is about 8 GB (on disk)
now, and projected to grow to 500 GB or more eventually. it
consists of redis hashes, which vary in size between 5 and 300
KB. at any given time, only a small percentage of hashes are in
use.

in theory, this looks like an ideal use case for redis vm, as the
"hot" dataset is small (only a few GB) and the individual values
are rather large.

in practice, we ran into the following problems:

a) stability

as reported in issue #395, redis 2.0.4 behaves very unstable with
both vm and replication, crashing twice on us after less than an
hour once we put load on it. we went back to 2.0.1, and we
considered it rock solid until yesterday, when it unexpectedly
crashed as well.

http://code.google.com/p/redis/issues/detail?id=395

b) memory consumption

as reported in issue #394 (by brett), redis with vm "greatly
exceeds vm-max-memory ... [and] continues to grow." our boxes
have 24 GB of ram (and no swap). we have configured vm as
follows:

vm-enabled yes
vm-max-memory 6gb
vm-page-size 8kb
vm-pages 50304000
vm-max-threads 4

after starting up an empty instance and populating it with 8 GB
of data, the redis process uses 16.9 GB of memory and continues
to grow over time -- first quickly, then more and more slowly.
the growth never completely stops though, so there seems to be
some kind of memory leak.

http://code.google.com/p/redis/issues/detail?id=394

c) persistence / durability

because of a), it would be very desirable to make frequents
backups. but available redis persistence options -- snapshots
and aof -- do not play well with vm.

to create a snapshot (with BGSAVE / SAVE), redis first has to
swap all data into memory before writing it out to disk again.
with a large dataset, this will take extremely long, and soon
becomes unfeasible -- especially in combination with b), as the
snapshotting thread consumes even more memory, and will
frequently be killed by the os before the snapshot can be
completed.

we assume aof (BGREWRITEAOF) behaves essentially the same,
although we have not tried that.

d) replication

unfortunately, replication soon becomes unfeasible as well, as
connecting a new slave forces the master to create a snapshot
internally, see c).

so, to sum this up -- after a while, you are stuck with an
in-memory database that you cannot backup, cannot replicate to
a standby machine, and that will eventually consume all memory
and crash (if it does not crash earlier).

conclusion: redis with vm enabled is pretty much unusable, and we
would really not recommend it to anybody else for production use
at the moment. (at least not as a database, it might work better
as a cache.)

as a workaround, we are going to turn off vm, and let the
application dump / restore individual hashes on-demand to / from
disk -- effectively we are going to write our own virtual memory
implementation in ruby.

although the main focus of redis development is on redis cluster
now, this will only partly solve the problem of large datasets.
to make vm a feasible option, at least a better persistence
strategy is urgently needed -- maybe by turning the swap file
into a durable representation of the whole dataset.

cheers
tim

--
http://tim.lossen.de

Salvatore Sanfilippo

unread,

Dec 23, 2010, 10:24:35 AM12/23/10

to redi...@googlegroups.com

On Thu, Dec 23, 2010 at 4:07 PM, Tim Lossen <t...@lossen.de> wrote:

> in practice, we ran into the following problems:
>
>
> a) stability
>
> as reported in issue #395, redis 2.0.4 behaves very unstable with
> both vm and replication, crashing twice on us after less than an
> hour once we put load on it. we went back to 2.0.1, and we
> considered it rock solid until yesterday, when it unexpectedly
> crashed as well.
>
> http://code.google.com/p/redis/issues/detail?id=395

Hello Tim, in order to get rid of this bug it is fundamental to have
two informations, but as it's hard to reproduce this bug, it's hard
for us without external help.

1) Without replication enabled, is the redis instance stable?
2) What happens switching to 2.2-RC1? Is stable? Even with replication?

Any help on this is really appreciated.

> b) memory consumption
>
> as reported in issue #394 (by brett), redis with vm "greatly
> exceeds vm-max-memory ... [and] continues to grow." our boxes
> have 24 GB of ram (and no swap). we have configured vm as
> follows:
>
> vm-enabled yes
> vm-max-memory 6gb
> vm-page-size 8kb
> vm-pages 50304000
> vm-max-threads 4

Here you probably run out of pages, or have too much keys, or
something like this, the only way to tell is that you post the INFO
output.
Redis is not able to swap keys in any way, nor to put more than a
single value in a single page.

The page-size should be more 64 bytes, than 8kb probably in your use case.

> after starting up an empty instance and populating it with 8 GB
> of data, the redis process uses 16.9 GB of memory and continues
> to grow over time -- first quickly, then more and more slowly.
> the growth never completely stops though, so there seems to be
> some kind of memory leak.

I don't think there is any memory leak, it sounds like a different problem.
With INFO output I can provide more information.

> c) persistence / durability
>
> because of a), it would be very desirable to make frequents
> backups. but available redis persistence options -- snapshots
> and aof -- do not play well with vm.
>
> to create a snapshot (with BGSAVE / SAVE), redis first has to
> swap all data into memory before writing it out to disk again.
> with a large dataset, this will take extremely long, and soon
> becomes unfeasible -- especially in combination with b), as the
> snapshotting thread consumes even more memory, and will
> frequently be killed by the os before the snapshot can be
> completed.

This problem is partially addressed in 2.2 (the child will not use too
much additional memory), but to fix this in a proper way what's needed
is to change the VM implementation so that the data is prefixed by the
length of such a data.
So in order to save the swapped out value there is to do just a read
and a write, without serialization / deserialization of what is stored
inside.
This is a change I want to do. There are also a few changes that can
make swapping out values almost twice as fast that should be added as
well.

> we assume aof (BGREWRITEAOF) behaves essentially the same,
> although we have not tried that.

Yes it is exactly the same as BGSAVE.

> d) replication
>
> unfortunately, replication soon becomes unfeasible as well, as
> connecting a new slave forces the master to create a snapshot
> internally, see c).
>
>
> so, to sum this up -- after a while, you are stuck with an
> in-memory database that you cannot backup, cannot replicate to
> a standby machine, and that will eventually consume all memory
> and crash (if it does not crash earlier).
>
> conclusion: redis with vm enabled is pretty much unusable, and we
> would really not recommend it to anybody else for production use
> at the moment. (at least not as a database, it might work better
> as a cache.)
>
>
> as a workaround, we are going to turn off vm, and let the
> application dump / restore individual hashes on-demand to / from
> disk -- effectively we are going to write our own virtual memory
> implementation in ruby.

The 2.2 VM is almost a rewrite, I think there is to consider how 2.2
behaves ASAP as probably the 2.2 VM is more stable / functional than
2.0.
Also you can likely turn the 2.0 vm into a more stable (and sometimes
much faster) one with this config line:

vm-max-threads 0

If you can acknowledge this it is much appreciated.
This will be also an interesting hint about where the bug is.

> although the main focus of redis development is on redis cluster
> now, this will only partly solve the problem of large datasets.
> to make vm a feasible option, at least a better persistence
> strategy is urgently needed -- maybe by turning the swap file
> into a durable representation of the whole dataset.

This is a problem (turning the swap file into a durable
representation) as it requires a major rewrite and turn the business
the other way around (in memory what is often used, but everything is
anyway flushed on the swap file from time to time).
Currently what I want to do to address this problems is:

1) Check if 2.2 behaves much better as we think it does.
2) Prefix pages into VM with real object length, so we can make
persistence much faster.
3) Possibly mmap() everything for performances.
4) Check how Redis cluster performs, and what the user reaction will be.
5) Even think about removing VM and killing the feature at all... if
needed for the 3.0 release.

I'll work at "2" and other VM enhancements starting from January, but
we also need help from users.
One of the problem of VM is that a very smallish amount of people is
using it, and this brings us to solution "5"...

Cheers,
Salvatore

--
Salvatore 'antirez' Sanfilippo
http://invece.org

"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele

Jonathan Leibiusky

unread,

Dec 23, 2010, 10:42:46 AM12/23/10

to redi...@googlegroups.com

we are deploying a very big implementation of redis with vm to production next week.
4 servers with 16 GB RAM, 12 allocated to redis. we are sharding between the 4 servers 1 TB of data, so every redis server will handle 250 GB of data. to be able to do that we configured VM and AOF. oh, and redis version is the latest 2.2 one.
will let you know how redis behaves in the next few weeks.
and if you have any suggestion just let me know, will be happy to try and play with the conf and give some feedback.

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.

Tim Lossen

unread,

Dec 23, 2010, 10:49:04 AM12/23/10

to redi...@googlegroups.com

On 2010-12-23, at 16:42 , Jonathan Leibiusky wrote:

> we are deploying a very big implementation of redis with vm to production next week.

uh, good luck :)

tim

--
http://tim.lossen.de

Salvatore Sanfilippo

unread,

Dec 23, 2010, 10:51:02 AM12/23/10

to redi...@googlegroups.com

On Thu, Dec 23, 2010 at 4:42 PM, Jonathan Leibiusky <iona...@gmail.com> wrote:
> we are deploying a very big implementation of redis with vm to production
> next week.
> 4 servers with 16 GB RAM, 12 allocated to redis. we are sharding between the
> 4 servers 1 TB of data, so every redis server will handle 250 GB of data. to
> be able to do that we configured VM and AOF. oh, and redis version is the
> latest 2.2 one.
> will let you know how redis behaves in the next few weeks.
> and if you have any suggestion just let me know, will be happy to try and
> play with the conf and give some feedback.

Thank you, make sure to use a good tradeoff between number of pages
and page size. It's a good idea to avoid using too big pages. try with
64 bytes for instance.

Please save any logs in case of stability problems, also make sure to
try if problems go away disconnecting the slave, or alternatively
using vm-max-threads 0.

I'll be very happy to provide support directly during the deployment
or when you got problems.

Thanks!
Salvatore

Tim Lossen

unread,

Dec 23, 2010, 11:04:20 AM12/23/10

to redi...@googlegroups.com

hello salvatore,

>> a) stability

> 1) Without replication enabled, is the redis instance stable?

well, i guess we will find out over christmas -- we are running
without a standby slave now.

> 2) What happens switching to 2.2-RC1? Is stable? Even with replication?

if it should crash again, i might try 2.2-RC1 instead, but without
replication.

>> b) memory consumption

> Here you probably run out of pages, or have too much keys, or
> something like this, the only way to tell is that you post the INFO
> output.

vm_conf_max_memory:6442450944
vm_conf_page_size:8192
vm_conf_pages:50304000
vm_stats_used_pages:1110340
vm_stats_swapped_objects:854020
vm_stats_swappin_count:192009
vm_stats_swappout_count:1046029
vm_stats_io_newjobs_len:0
vm_stats_io_processing_len:0
vm_stats_io_processed_len:0
vm_stats_io_active_threads:0
vm_stats_blocked_clients:0
db2:keys=917917,expires=0
db3:keys=13,expires=0

> Also you can likely turn the 2.0 vm into a more stable (and sometimes
> much faster) one with this config line:
>
> vm-max-threads 0
>
> If you can acknowledge this it is much appreciated.

hmmm, i have been toying with the idea, but i am afraid of blocking
other clients -- although we have really fast disks in raid 10.

>> to make vm a feasible option, at least a better persistence
>> strategy is urgently needed -- maybe by turning the swap file
>> into a durable representation of the whole dataset.
>
> This is a problem (turning the swap file into a durable
> representation) as it requires a major rewrite and turn the business
> the other way around (in memory what is often used, but everything is
> anyway flushed on the swap file from time to time).

it is the only viable strategy that i can think of, at least
for really large datasets.

> Currently what I want to do to address this problems is:
>
> 1) Check if 2.2 behaves much better as we think it does.
> 2) Prefix pages into VM with real object length, so we can make
> persistence much faster.
> 3) Possibly mmap() everything for performances.
> 4) Check how Redis cluster performs, and what the user reaction will be.
> 5) Even think about removing VM and killing the feature at all... if
> needed for the 3.0 release.
>
> I'll work at "2" and other VM enhancements starting from January, but
> we also need help from users.
> One of the problem of VM is that a very smallish amount of people is
> using it, and this brings us to solution "5"...

yeah, i understand, and i am afraid my rant might drive the number
even further down ;)

Jonathan Leibiusky

unread,

Dec 23, 2010, 11:05:34 AM12/23/10

to redi...@googlegroups.com

On Thu, Dec 23, 2010 at 12:51 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote:

On Thu, Dec 23, 2010 at 4:42 PM, Jonathan Leibiusky <iona...@gmail.com> wrote:
> we are deploying a very big implementation of redis with vm to production
> next week.
> 4 servers with 16 GB RAM, 12 allocated to redis. we are sharding between the
> 4 servers 1 TB of data, so every redis server will handle 250 GB of data. to
> be able to do that we configured VM and AOF. oh, and redis version is the
> latest 2.2 one.
> will let you know how redis behaves in the next few weeks.
> and if you have any suggestion just let me know, will be happy to try and
> play with the conf and give some feedback.

Thank you, make sure to use a good tradeoff between number of pages
and page size. It's a good idea to avoid using too big pages. try with
64 bytes for instance.

I ran on a smaller version of the same data redis-stat (https://github.com/antirez/redis-tools) and it told me that the best size for the page is 4096 bytes. Do you still think it is better to leave it 64 bytes?

Oh and using AOF and having 16 GB RAM, giving redis 12 GB do you think it is OK or in order to do BGREWRITEAOF once a day (during low traffic hours) it is better to have more free RAM for the child process that regenerate the aof fie?

Please save any logs in case of stability problems, also make sure to
try if problems go away disconnecting the slave, or alternatively
using vm-max-threads 0.

Sure! Will do this in report back in case of problems.

I'll be very happy to provide support directly during the deployment
or when you got problems.

Will be always in IRC (xetorthio) and here and report as real time as possible so we can play with configuration and log everything to have good feedback :)

Salvatore Sanfilippo

unread,

Dec 23, 2010, 11:19:19 AM12/23/10

to redi...@googlegroups.com

On Thu, Dec 23, 2010 at 5:04 PM, Tim Lossen <t...@lossen.de> wrote:
> hello salvatore,
>
>>> a) stability
>
>> 1) Without replication enabled, is the redis instance stable?
>
> well, i guess we will find out over christmas -- we are running
> without a standby slave now.

Thanks this is useful

>> 2) What happens switching to 2.2-RC1? Is stable? Even with replication?
>
> if it should crash again, i might try 2.2-RC1 instead, but without
> replication.

As a bonus VM of 2.2 is a major rewrite with the good side effects
that objects are not "fat objects" but just normal Redis objects like
when running without VM enabled.
So it should use a lot less memory.

>>> b) memory consumption
>
>> Here you probably run out of pages, or have too much keys, or
>> something like this, the only way to tell is that you post the INFO
>> output.
>
> vm_conf_max_memory:6442450944
> vm_conf_page_size:8192
> vm_conf_pages:50304000
> vm_stats_used_pages:1110340
> vm_stats_swapped_objects:854020
> vm_stats_swappin_count:192009
> vm_stats_swappout_count:1046029
> vm_stats_io_newjobs_len:0
> vm_stats_io_processing_len:0
> vm_stats_io_processed_len:0
> vm_stats_io_active_threads:0
> vm_stats_blocked_clients:0
> db2:keys=917917,expires=0
> db3:keys=13,expires=0

It seems like everything is already swapped, since total keys ->
917917, swapped -> 854020. Possibly the no swapped part is about
numbers that are shared objects so will never get swapped or something
like this. Hard to tell without knowing exactly the nature of the data
set.

>> Also you can likely turn the 2.0 vm into a more stable (and sometimes
>> much faster) one with this config line:
>>
>> vm-max-threads 0
>>
>> If you can acknowledge this it is much appreciated.
>
> hmmm, i have been toying with the idea, but i am afraid of blocking
> other clients -- although we have really fast disks in raid 10.

Chances are that the VM of 2.0 will be much faster with this
setting... it is blocking but it is much more efficient since there is
no message passing between different threads, clients are not
suspended, and so forth.

>> This is a problem (turning the swap file into a durable
>> representation) as it requires a major rewrite and turn the business
>> the other way around (in memory what is often used, but everything is
>> anyway flushed on the swap file from time to time).
>
> it is the only viable strategy that i can think of, at least
> for really large datasets.

Probably yes, but it's important to realize that the price to pay is,
not "point in time" dump. You have a representation of all the
keys/values on disk, and copy of the most used things on memory (big
win -> vm-maxmemory can always be honored). This is how this could
work:

1 - Every time we read from a key that is already in memory, nothing
special, we read...
2 - Every time we write to a key that is already in memory, we write,
and we put this key into a "need-to-flush-on-disk" queue or alike.
Possibly we wait some second before flushing, configurable, so N
writes will result into a single write to disk.
3 - Every time we read/write from keys not in memory, we load it in
memory (possibly removing some other key in memory using LRU), and
then we do 1) or 2) accordingly.

So persistence is per-key in this design. Not a big deal. For sure a
design much better than the current one. Also it is interesting that
this design will bring the DB online one second after typing
./redis-server.

>> Currently what I want to do to address this problems is:

> yeah, i understand, and i am afraid my rant might drive the number
> even further down ;)

Unfortunately your rant is more than justified since while VM works in
some scenario it is hardly a good implementation currently.
There is to understand if it's the way to go, if it's worth improving
for instance with the design above, if it's better to kill it, and so
forth.

Salvatore Sanfilippo

unread,

Dec 23, 2010, 12:01:04 PM12/23/10

to redi...@googlegroups.com

On Thu, Dec 23, 2010 at 5:19 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote:
> 1 - Every time we read from a key that is already in memory, nothing
> special, we read...
> 2 - Every time we write to a key that is already in memory, we write,
> and we put this key into a "need-to-flush-on-disk" queue or alike.
> Possibly we wait some second before flushing, configurable, so N
> writes will result into a single write to disk.
> 3 - Every time we read/write from keys not in memory, we load it in
> memory (possibly removing some other key in memory using LRU), and
> then we do 1) or 2) accordingly.

auto-replying just to add, it is important to understand one side
effect of this design, that is, every access to a non existing key
will cost us a disk access.

This is not too bad as most applications rarely access non existing
keys, if not for creating it.
There is also the escamotage of negative caching, that is, reserving
some memory or using a good heuristic to take in memory the
information that a given key is not on disk (a bloom filter can help
here).

But there are very good things about this solution, and a good part
can be implemented on top of our current VM layer.

Tim Lossen

unread,

Dec 23, 2010, 12:12:22 PM12/23/10

to redi...@googlegroups.com

salvatore,

>> vm_stats_swapped_objects:854020
>> vm_stats_swappin_count:192009
>> vm_stats_swappout_count:1046029

>> db2:keys=917917,expires=0
>> db3:keys=13,expires=0
>
> It seems like everything is already swapped, since total keys ->
> 917917, swapped -> 854020. Possibly the no swapped part is about
> numbers that are shared objects so will never get swapped or something
> like this. Hard to tell without knowing exactly the nature of the data
> set.

the values in db3 are shared data structures that are accessed
all the time. db2 contains only hashes, the bulk of the dataset.
each hash represents a user and is only needed when the user
is online, so these can be swapped out most of the time.

>>> This is a problem (turning the swap file into a durable
>>> representation) as it requires a major rewrite and turn the business
>>> the other way around (in memory what is often used, but everything is
>>> anyway flushed on the swap file from time to time).
>>
>> it is the only viable strategy that i can think of, at least
>> for really large datasets.
>
> Probably yes, but it's important to realize that the price to pay is,
> not "point in time" dump.

that is a price i would be very willing to pay -- it would mean
that we could never lose more than x minutes of updates per user,
which is good enough (at least for our use case, social games).

> You have a representation of all the
> keys/values on disk, and copy of the most used things on memory (big
> win -> vm-maxmemory can always be honored).

yeah, having control over memory consumed would be awesome. :)

> This is how this could work:
>
> 1 - Every time we read from a key that is already in memory, nothing
> special, we read...
> 2 - Every time we write to a key that is already in memory, we write,
> and we put this key into a "need-to-flush-on-disk" queue or alike.
> Possibly we wait some second before flushing, configurable, so N
> writes will result into a single write to disk.

this part we have already implemented in our application, and it
works really well. we flush hashes to disk 10 minutes after they
are first modified, and every 10 minutes again as long as there
are new modifications.

we chose 10 minutes because user sessions only last 5 minutes on
average, so we save a lot of disk io this way.

the flushing strategy (every x minutes, after x modifications ...)
would be a very important knob in the new implementation, i think.

> 3 - Every time we read/write from keys not in memory, we load it in
> memory (possibly removing some other key in memory using LRU), and
> then we do 1) or 2) accordingly.
>
> So persistence is per-key in this design.

exactly. the last part we still have to get working -- we plan to use
key expiration -- and then we can switch off vm.

> For sure a
> design much better than the current one. Also it is interesting that
> this design will bring the DB online one second after typing
> ./redis-server.

yeah, i am very much looking forward to that as well. :)

Salvatore Sanfilippo

unread,

Dec 23, 2010, 12:28:40 PM12/23/10

to redi...@googlegroups.com

Some more remark while the ideas are hot.

This is actually no longer VM, is a different persistent engine, as it
makes sense to use it even when RAM is not a problem in order to
obtain fast restart, no forking child, and so forth, in exchange for
performance if the application is very write intensive, or
alternatively in exchange to durability (if there are many writes even
if the whole database fits in RAM, we need to flush this writes on
disk ASAP, consuming CPU and possibly not being able to flush data on
disk as fast as configured).

There is a big open problem, replication. The current replication
design assumes we are able to transfer a point-in-time dump, but with
such a persistence engine this is no longer possible. So there is to
invent something new... but I doubt there are good solutions to this
problem.

Cheers,
Salvatore

Salvatore Sanfilippo

unread,

Dec 23, 2010, 3:00:36 PM12/23/10

to redi...@googlegroups.com

On Thu, Dec 23, 2010 at 6:28 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote:
> There is a big open problem, replication. The current replication
> design assumes we are able to transfer a point-in-time dump, but with
> such a persistence engine this is no longer possible. So there is to
> invent something new... but I doubt there are good solutions to this
> problem.

I just found a solution to this (it's not different than current VM
itself, just suspend writes to have a stable dump on disk, fork, do a
scan to produce the .rdb, or even just copy it and make the slave able
to understand this format. Both ways it's mostly as fast as copying a
file as we'll make sure our btree dump values will use the same
serialization of .rdb).

And there are tons of other optimizations we can implement. For
instance when storing ziplists or zipmaps and so forth, just store
them as blobs as they are already blobs. If you are into Redis
internals you should know what an improvement this alone can be.

Btw I'll produce a detailed writeup in my blog about all this future
directions. We know since a lot of time that VM implementation is not
optimal at all, but the direction was to make cluster cool and drop
this support eventually. Maybe we have better options, and at least
there is to try with what seems a much saner design.

I'll post a message here when I'll post the blog entry on antirez.com

Cheers,

Tim Lossen

unread,

Dec 23, 2010, 4:12:03 PM12/23/10

to redi...@googlegroups.com

On 2010-12-23, at 21:00 , Salvatore Sanfilippo wrote:
> And there are tons of other optimizations we can implement. For
> instance when storing ziplists or zipmaps and so forth, just store
> them as blobs as they are already blobs. If you are into Redis
> internals you should know what an improvement this alone can be.

another suggestion: we store each value in a separate file, using
a two-level directory tree like bigdis. this works really well and
makes the on-disk representation less opaque than one big file.
one concrete advantage is that we can do incremental backups using
rsync -- something like poor man's snapshots (as unfortunately we
don't have ZFS).

> Btw I'll produce a detailed writeup in my blog about all this future
> directions.

good, very much looking forward to that!

Xiangrong Fang

unread,

Dec 23, 2010, 6:14:05 PM12/23/10

to redi...@googlegroups.com

This is a very important post which I would trace, because we are also planning to use redis as our only database backend.

Personally I think redis persistence/vm is MUCH more important than clustering. Because, anyway, what's wrong with consistent hashing on the client side just like memcached? Clustering does not offer much value over the built-in sharding support of some redis clients already do today, say predis...

At least, the benefit of clustering is not without cost, the protocol has changed a bit and you will need "smart client" to take full advantage of it, and, you cannot totally forget about data structuring (i.e. namespace planning to ensure that some related keys ends up on the same server), which is similar as you are doing client side sharding.

Among these key features, I personally think that their importance rank as follow: persistence (rdb/aof implementation), vm, m/s replication and finally clustering.

Best Regards,

Shannon

2010/12/24 Tim Lossen <t...@lossen.de>

--

Sam Stokes

unread,

Dec 23, 2010, 10:57:35 PM12/23/10

to Redis DB

On Dec 23, 8:05 am, Jonathan Leibiusky <ionat...@gmail.com> wrote:
> ...

> I ran on a smaller version of the same data redis-stat (https://github.com/antirez/redis-tools) and it told me that the best size
> for the page is 4096 bytes. Do you still think it is better to leave it 64
> bytes?

I believe there's a bug in redis-stat that causes it to always report
4096 bytes regardless of the content of your database:
http://code.google.com/p/redis/issues/detail?id=357 So you may want
to take its recommendation with a pinch of salt for now :)

(As mentioned on the issue report, if you build redis-tools from my
fork at http://github.com/rapportive-oss/redis-tools you should get a
working redis-stat.)

Hope that helps,
--
Sam

Jeremy Zawodny

unread,

Dec 24, 2010, 12:34:53 AM12/24/10

to redi...@googlegroups.com

I feel exactly the opposite.

Clustering is important for some people. We're using consistent hashing already, so it's less of an issue.

But the durable on-disk data store is a solved problems with many options to choose from: PostgreSQL, MySQL/InnoDB, MongoDB (I know that single server durability is an issue for some folks--but that'll being worked on), Berkeley DB, Tokyo Cablinet/Tyrant, etc.

What makes Redis unique and compelling for most people IS that fact that it's designed with the idea that it'll primarily be used as an in-memory data structure server. VM is interesting for some folks, but I kind of feel like it's going to be the way that the minority of Redis deployments are configured.

Clearly I need to think of a better way to communicate this...

Jeremy

Xiangrong Fang

unread,

Dec 24, 2010, 1:02:42 AM12/24/10

to redi...@googlegroups.com

Sorry I don't agree :-) The KEY power of redis is its data structure, which I think we all agree. If using a traditional sql backend, entire infrastructure will be MUCH more complicated, e.g. you need ORM or whatever to persist data into sql db, you also need to design your sql tables alongside redis key space. As to mongodb, using mongodb is a very different approach but comparable to using redis. But anyway I also don't see any need to, say, put redis before mongodb to use redis as a "cache".

The problem of "single server durability" is not an issue for me (and most redis adopters I think), and the lack of sql-level transaction/durability is also NOT important at all for who choose to use redis. The key problem of current redis persistence is its huge impact on system performance so that the capacity (ram) of a server could not be fully utilized. But anyway I think v2.2 should be much better already as said by Salvatore.

Finally, I think hardening persistence and vm is in sync with redis's minimalistic philosophy, and to squeeze the last bit of power out of your hardware.

Best Regards,

Shannon

2010/12/24 Jeremy Zawodny <Jer...@zawodny.com>

Tim Lossen

unread,

Dec 24, 2010, 7:42:01 AM12/24/10

to redi...@googlegroups.com

On 2010-12-23, at 22:12 , Tim Lossen wrote:

> another suggestion: we store each value in a separate file, using
> a two-level directory tree like bigdis. this works really well and
> makes the on-disk representation less opaque than one big file.
> one concrete advantage is that we can do incremental backups using
> rsync -- something like poor man's snapshots (as unfortunately we
> don't have ZFS).

plus, we don't have to pre-allocate a fixed size swap file in advance,
and thus we avoid the headache of deciding on the "right" number and
size of swap pages, and the size of the dataset is not artificially
limited in any way.

tim

--
http://tim.lossen.de

Salvatore Sanfilippo

unread,

Dec 24, 2010, 8:12:36 AM12/24/10

to redi...@googlegroups.com

Tim this is hardly scalable with million of keys, or better, it is a
lot *about* your filesystem.

What you are doing in this way is using the filesystem as a btree
implementation. If it is a good implementation for holding tons of
small files, it's going to work well, otherwise, it's going to work
not very well :)

I used this in bigdis as the bigdis prototype was designed for big
values. In this contest the overhead of the filesystem is not going to
be so big.

But for millions or billions of small keys -> values, is this going to
work well? In what file systems?
How much performances / efficiency we lose compared to a real btree in
a single file?

We need to have good answers to this questions in order to pick the best option.

Cheers,
Salvatore

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>

--

Salvatore Sanfilippo

unread,

Dec 24, 2010, 8:15:04 AM12/24/10

to redi...@googlegroups.com

On Thu, Dec 23, 2010 at 5:05 PM, Jonathan Leibiusky <iona...@gmail.com> wrote:

> I ran on a smaller version of the same data redis-stat
> (https://github.com/antirez/redis-tools) and it told me that the best size
> for the page is 4096 bytes. Do you still think it is better to leave it 64
> bytes?
>
> Oh and using AOF and having 16 GB RAM, giving redis 12 GB do you think it is
> OK or in order to do BGREWRITEAOF once a day (during low traffic hours) it
> is better to have more free RAM for the child process that regenerate the
> aof fie?

redis-stat is really a prototype, better to use 64 bytes as page size ;)

AOF is perfectly fine, but the BGREWRITEAOF is up to the amount of
writes you have.
What I suggest is to try how much it grows every minute on average, so
that you make
sure to rewrite it when it will get too big compared to how big it is
just after a BGREWRITEAOF completed successfully.

For instance if you have mostly reads, you can rewrite just every 24 hours.

Salvatore Sanfilippo

unread,

Dec 24, 2010, 8:22:58 AM12/24/10

to redi...@googlegroups.com

On Fri, Dec 24, 2010 at 6:34 AM, Jeremy Zawodny <Jer...@zawodny.com> wrote:
> I feel exactly the opposite.
> Clustering is important for some people. We're using consistent hashing
> already, so it's less of an issue.
> But the durable on-disk data store is a solved problems with many options to
> choose from: PostgreSQL, MySQL/InnoDB, MongoDB (I know that single server
> durability is an issue for some folks--but that'll being worked on),
> Berkeley DB, Tokyo Cablinet/Tyrant, etc.
> What makes Redis unique and compelling for most people IS that fact that
> it's designed with the idea that it'll primarily be used as an in-memory
> data structure server. VM is interesting for some folks, but I kind of feel
> like it's going to be the way that the minority of Redis deployments are
> configured.
> Clearly I need to think of a better way to communicate this...

Jeremy I share your vision in many ways.

In a sentence I could say that while Redis is "one of the many" from
the point of view of disk persistence, it is pretty unique as an
in-memory DB and in the operations that are efficient since it is in
memory (like sorted sets).

I want to improve persistence, VM (that is just going to be a
different persistence engine, no longer a real virtual memory if we
implement the planned changes), as they are important for the single
node business, being this node part of a cluster or just a single
instance, but cluster will be the major focus of 2011 for sure.

Possibly working only at Redis cluster is not a good idea, so we'll
try to carry on this other work at the same time. I hope that for the
end of January I can publish the cluster code in his alpha stage but
with the main ideas already in place so that I and Pieter can start
working together at it. I'm not doing this now just because I think
it's very hard to collaborate on a design that is too much a work in
progress, and that the initial design of everything is bester done by
single persons (not just in computer science, in almost everything).

Tim Lossen

unread,

Dec 24, 2010, 9:22:00 AM12/24/10

to redi...@googlegroups.com

yeah, i am aware that for small values there is considerable
overhead. on the other hand, vm doesn't work that well for small
values anyway.

this approach certainly *does* scale to millions of keys -- we
currently use two-digit decimal directories, i.e. 00 to 99, on
two levels, so 10000 "leaf" directories in total. given a million
evenly distributed keys, this means around 100 values per dir.

two levels of two-digit hex directories would result in 65536 leaf
directories, or 65 millions keys with around 1000 keys per dir.

making it scale to billions of keys might be tricky though ;)

tim

--
http://tim.lossen.de

Salvatore Sanfilippo

unread,

Dec 24, 2010, 9:28:28 AM12/24/10

to redi...@googlegroups.com

On Fri, Dec 24, 2010 at 3:22 PM, Tim Lossen <t...@lossen.de> wrote:

> making it scale to billions of keys might be tricky though ;)

Yes, there is also a performance issue... and btree.c/h inside sqlite
are very tempting ;)

Cheers,
Salvatore

Tim Lossen

unread,

Dec 24, 2010, 9:32:47 AM12/24/10

to redi...@googlegroups.com

so far, with close to a million keys, i am not overly worried
about performance:

ubuntu@peking:/data/users$ time find . | wc -l
943431

real 0m0.634s
user 0m0.260s
sys 0m0.400s

but yeah, i see your point :)

cheers
tim

Tim Lossen

unread,

Jan 3, 2011, 4:07:48 PM1/3/11

to redi...@googlegroups.com

hello salvatore,

i still think there is a memory leak (issue #394).

here is some data (output of 'ps aux') that i collected over
the last week. it shows the growing memory consumption of our
redis process (no slaves connected, no persistence, vm enabled,
vm-max-memory 6gb):

USER PID %CPU %MEM VSZ RSS

Mon Dec 27 07:27:37 CET 2010
redis 24044 20.7 76.2 19091212 18871980

Wed Dec 29 20:55:34 CET 2010
redis 24044 23.8 78.3 19617544 19379564

Thu Dec 30 14:15:00 CET 2010
redis 24044 23.9 78.3 19619592 19381604

Fri Dec 31 12:41:20 CET 2010
redis 24044 24.4 78.9 19754760 19539692

Sat Jan 1 20:55:54 CET 2011
redis 24044 24.4 79.5 19911428 19674032

Sun Jan 2 23:08:13 CET 2011
redis 24044 25.1 79.7 19969796 19735380

redis is doing between 2000 and 8000 operations per second,
and we seem to be losing almost a gig of memory per week.
(info has constantly been reporting 'used_memory_human:6.00G'
the whole time.)

cheers
tim

On 2010-12-23, at 16:24 , Salvatore Sanfilippo wrote:

> On Thu, Dec 23, 2010 at 4:07 PM, Tim Lossen <t...@lossen.de> wrote:
>> after starting up an empty instance and populating it with 8 GB
>> of data, the redis process uses 16.9 GB of memory and continues
>> to grow over time -- first quickly, then more and more slowly.
>> the growth never completely stops though, so there seems to be
>> some kind of memory leak.
>
> I don't think there is any memory leak, it sounds like a different problem.

--
http://tim.lossen.de

Tim Lossen

unread,

Jan 4, 2011, 2:06:39 AM1/4/11

to redi...@googlegroups.com

ps: yes, the number of (integer) keys has increased over the last
week, by around 160K -- but that accounts for only 10MB of the
increase (assuming that 16 million integer keys take up 1 gig).

Pieter Noordhuis

unread,

Jan 4, 2011, 3:36:19 AM1/4/11

to redi...@googlegroups.com

Hi Tim,

This is a long shot: what kind of data are you storing in Redis? With vm-max-memory set to 6G and an RSS of 18G it is unlikely that this is caused by fragmentation. However, when the values that get swapped in are large, the memory may burst to a lot more. Because Redis doesn't do GC/memory defragmentation, any page that is claimed on the top of the heap will stay allocated until there are no more allocations in it, *and* it is still the top of the heap. Could you try and see how Redis behaves when you compile it against tcmalloc and put the same load on it?

Cheers,

Pieter

Salvatore Sanfilippo

unread,

Jan 4, 2011, 4:41:47 AM1/4/11

to redi...@googlegroups.com

On Tue, Jan 4, 2011 at 9:36 AM, Pieter Noordhuis <pcnoo...@gmail.com> wrote:
> Hi Tim,
> This is a long shot: what kind of data are you storing in Redis? With
> vm-max-memory set to 6G and an RSS of 18G it is unlikely that this is caused
> by fragmentation. However, when the values that get swapped in are large,
> the memory may burst to a lot more. Because Redis doesn't do GC/memory
> defragmentation, any page that is claimed on the top of the heap will stay
> allocated until there are no more allocations in it, *and* it is still the
> top of the heap. Could you try and see how Redis behaves when you compile it
> against tcmalloc and put the same load on it?

Tim, Pieter,

indeed it is impossible that it's a memory leak since the memory
reported by Redis is always 6GB.
Redis traps every allocation made using a wrapper (zmalloc.c), so
memory leaks will have the effect of showing the memory report
increasing over time.

I think this is definitely fragmentation, and is probably related to
the use of hashes. My guess is that what happens in datasets where
there are a lot of specially encoded hashes that have the property
that, in general, all grow the same size monotonically, is that old
allocations are discarded almost always for bigger ones.

When hashes are not integer encoded this is not a problem since the
sigle objects are always the same size, but here is different. The
fragmentation will stop at some point, it should not be a never ending
process.
The solution to this problem is a slab allocator, but may have the
effect of using much more ram in the first instance, as memory is
allocated according to sizes that are usually power of two or alike.

Another problem you are experiencing is that 2.0 used a larger Redis
object when VM was enabled.
I strongly suggest upgrading to 2.2.

Later today I'll send you a small ruby script you can run against your
instance to sample some information about your dataset (it's something
safe to run in production), but please if you can upgrade to 2.2 asap.
Before you upgrade to 2.2 I'll add later today a CONFIG SET option so
that you can change vm-max-memory dynamically at runtime. So you can
start with smaller memory and use more and more memory as you see that
fragmentation will get stable, but anyway, with 2.2 you are going to
make a much better use of your memory, because:

1) LRU for eviction in VM.
2) Smaller objects.

That said, now that there is no replica, is my understanding correct
that Redis 2.0.x is no longer crashing with VM enabled? This would
confirm our guesses that the problem is replication + VM, so we can
create a testing environment to reproduce the problem and fix it, and
check if it also happens in 2.2 (I'm optimistic about it since it's
almost a complete rewrite).

About the future of "big data Redis" I'm working hard at diskstore,
the idea we discussed in this thread.
It's already working and I'll send a detailed email about it today in
the mailing list, I really need help in evaluating how well it works,
but the first results are really encouraging.

Tim Lossen

unread,

Jan 4, 2011, 5:46:58 AM1/4/11

to redi...@googlegroups.com

salvatore, pieter,

thanks for looking into this matter.

> indeed it is impossible that it's a memory leak since the memory
> reported by Redis is always 6GB.

> I think this is definitely fragmentation, and is probably related to
> the use of hashes.

yes, very likely. in fact, we have moved all non-hash values to a
different redis instance yesterday, and the RSS is still increasing.

> Later today I'll send you a small ruby script you can run against your
> instance to sample some information about your dataset (it's something
> safe to run in production),

ok, i am happy to help to nail down this problem, if possible.

> but please if you can upgrade to 2.2 asap.

hmmmm .... we are not very keen on that. this is a production system,
and every restart of redis involves around 30 minutes of downtime, as
we have to load all data into the new instance. if redis 2.2 should not
perform as expected, we would incur yet another downtime to switch back.

> with 2.2 you are going to
> make a much better use of your memory, because:
>
> 1) LRU for eviction in VM.
> 2) Smaller objects.

interesting -- what is the eviction strategy in 2.0, not LRU?

> That said, now that there is no replica, is my understanding correct
> that Redis 2.0.x is no longer crashing with VM enabled?

yes, we are running without replication now. hard to say if this has
improved stability, though: the last time (= with replication) redis
crashed after 20 days, current uptime is 12 days.

> About the future of "big data Redis" I'm working hard at diskstore,
> the idea we discussed in this thread.
> It's already working and I'll send a detailed email about it today in
> the mailing list

cool, very much looking forward to that!

we have been working on our own application-specific "restore-on-demand"
implementation as well, and once this is stable, we plan to finally turn
off vm in a few days.

Alfredo Almeida

unread,

Jan 5, 2011, 7:10:10 AM1/5/11

to redi...@googlegroups.com

I have to admit this was a very complex selection of posts.

It was an excellent example of not ranting before you fully understand the problem space.

But the level of the conversation was exceptional.

It answered a great deal of questions I had about Redis.

I was considering a Mongodb or a Redis app and even posted on Twitter for advice on which is better. I'm convinced. Redis it is.

Thanks guys,

Demis Bellot

unread,

Jan 5, 2011, 7:15:36 AM1/5/11

to redi...@googlegroups.com

Yep, that's pretty much our opinion of Redis as well :)

There is a simplicity and purity about Redis which I believe lends itself to 'cleaner solutions'....

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.

--
- Demis

http://twitter.com/demisbellot

http://www.servicestack.net/mythz_blog

Marc Byrd

unread,

Jan 5, 2011, 1:58:07 PM1/5/11

to redi...@googlegroups.com

Before VM becomes history, it might be useful to understand what went wrong, and apply lessons learned in new approaches. Perhaps this might even cause a reconsideration of VM's design and ongoing usefulness.

VM was implemented to only operate on top-level keys, right?

Under such a scheme, if one has several zsets with cardinality following a long tail distribution, with a corresponding visit probability (a common use case in practice), then:

there are very many small zsets which are rarely visited - and are therefore cached - but without much gain because they're small and we still have to store the key, value type, and reference - probably just as much as simply storing the value and score.
there are a few very large zsets which are visited often - and therefore cannot be cached, and if they are cached, the whole thing gets pulled into memory even if one simply wants to know the score on a single item.

What if the persistence were pushed down a level? That would address problems of the second type.

The first type seems much more difficult. Fully 1/3 of our system's footprint is what could be called "onezies" - a zset with one entry, which is rarely visited, and yet gain nothing from virtualization.

It would be great to find a solution or even a work around for problems of the first type (long tail, onezies).

Finally, do the newer approaches being considered address these two types of problems, and if so how?

Thanks,

Marc

Salvatore Sanfilippo

unread,

Jan 5, 2011, 2:16:03 PM1/5/11

to redi...@googlegroups.com

On Wed, Jan 5, 2011 at 7:58 PM, Marc Byrd <dr.mar...@gmail.com> wrote:
> Before VM becomes history, it might be useful to understand what went wrong,

[snip]

> What if the persistence were pushed down a level? That would address

[snip]

> Finally, do the newer approaches being considered address these two types of
> problems, and if so how?

Hello Marc,

you are right, it is important to consider what was wrong with VM.
But I don't think it was the granularity.

Most objects in Redis are small, especially when they are pieces of
bigger aggregate data types.
So even to just take the pointers to data on disk is not just hard,
but also mostly useless as it takes many times the same memory.

Another reason is that with special encoded data types we have now
(for hashes, lists, and sets under a given size) it makes a lot of
sense to write them as a single blob on disk, as they are already this
way in memory.

Under all this assumptions, there is *no* good solution for handling
on disk large aggregate data types, but:

1) Change the application so that if possible smaller aggregated types are used.
2) Implement the data structures we support *on disk*.

or simply... if you have a big sorted set, use Redis in the native
in-memory way.

Btw approach '1' can be done by the guy writing the application, so
that diskstore becomes viable.
Approach '2' is a nightmare, as all the atomic stuff we do are fast
and trivial as they are done in RAM.

So that said, I think both the idea of having on-disk representation
of what we have in ram, in a way that are hackable directly on disk,
or having an finer granularity, is not going to help, and was not the
problem with VM.

So what was the problem with VM? :)

1) Was not able to swap keys. This was an huge mistake. Typical
application of Redis diskstore will be to have millions of objects
with a small hash inside. If you can't swap keys, keys alone are too
much data in RAM already for a small instance. I think that swapping
sub-values is reiterating this error indeed.

2) We had most of the data on disk, but this stored representation was
a trow-away business, not useful for persistence. So we ended having
the same data on disk on the swap file, on memory, and on .rdb for
persistency. Far from ideal.

3) It was not ok with the persistence "COW style" model of Redis. In
order to save the dataset there was to basically access the whole swap
file, value after value, and it was incredibly slow.

4) It was too much similar to in-memory from the point of view of
tradeoffs. Same consistency, same persistence, same slow start (but
much slower since there was swapping involved). The new system is
really an alternative so the use cases are pretty distinct, it's not
very hard to tell when in-memory can work, or diskstore can work, or
neither (or both).

5) The implementation was too complex. This was definitely fixable, as
the implementation of diskstore was theoretically applicable to the
old VM code as well. Proof: diskstore worked fine after 5 days of
work. To reach the same level of stability VM took 1 month of work.

To get diskstore better I think we need people applying it in the real
world ASAP.
And for this we need to ship a stable enough implementation ASAP. But
we are not too far ;)

Marc Byrd

unread,

Jan 6, 2011, 2:26:51 PM1/6/11

to redi...@googlegroups.com

Salvatore wrote:

1) Change the application so that if possible smaller aggregated types are used.
2) Implement the data structures we support *on disk*.

or simply... if you have a big sorted set, use Redis in the native
in-memory way.

Btw approach '1' can be done by the guy writing the application, so
that diskstore becomes viable.

Let me play dumb for a second here: How can I, the guy writing the application, use smaller aggregated types? If I currently have a zset which has 1M elements, and I want to set an upper limit of some optimal size, say 1000, how do I segment this (and still be able to do zinter, zunion, etc.?) Now that I write this I see how it could be done quite elegantly with regular sets, a little extra set math, perhaps some temporary results. But my sets are pretty well behaved and it's zsets following this long tail distribution that are the challenge (and does anyone else agree that this is a common use case?)

m

--

Tim Lossen

unread,

Jan 7, 2011, 9:46:07 AM1/7/11

to redi...@googlegroups.com

update: as it turns out, the hashes are innocent, after all.
since they have redis to themselves, memory usage is not
increasing any more -- even decreasing a tiny little bit:

USER PID %CPU %MEM VSZ RSS

Tue Jan 4 12:10:06 CET 2011
redis 24044 24.6 80.3 20061956 19880124

Wed Jan 5 18:06:25 CET 2011
redis 24044 24.3 80.3 20061956 19871592

Thu Jan 6 10:24:08 CET 2011
redis 24044 24.0 80.3 20061956 19871592

Fri Jan 7 15:27:28 CET 2011
redis 24044 23.6 80.3 20061956 19871584

the most likely culprit for memory fragmentation seem to be
sorted sets now. for example, we use one sorted set to keep
track of currently active users, and it is continuously being
updated (ZADD / ZREMRANGEBYSCORE). this is one of the values
we moved to a separate redis instance (without vm).

tim

On 2011-01-04, at 11:46 , Tim Lossen wrote:

>> indeed it is impossible that it's a memory leak since the memory
>> reported by Redis is always 6GB.
>
>> I think this is definitely fragmentation, and is probably related to
>> the use of hashes.
>
> yes, very likely. in fact, we have moved all non-hash values to a
> different redis instance yesterday, and the RSS is still increasing.

--
http://tim.lossen.de

Salvatore Sanfilippo

unread,

Jan 7, 2011, 9:57:06 AM1/7/11

to redi...@googlegroups.com

On Thu, Jan 6, 2011 at 8:26 PM, Marc Byrd <dr.mar...@gmail.com> wrote:

> Let me play dumb for a second here: How can I, the guy writing the
> application, use smaller aggregated types? If I currently have a zset which
> has 1M elements, and I want to set an upper limit of some optimal size, say
> 1000, how do I segment this (and still be able to do zinter, zunion, etc.?)

In that case you simply don't want to use diskstore but memory.
Let's put this into another shape:

use case for in-memory redis: everything supported by the data model.

but I've not so much memory, my dataset is too large!
Ok, then you can use diskstore, but it will work well for a subset of
use cases, that are, guess what, very similar to the use cases you can
model with an on-disk store. That is mainly plain key -> value
business, or anything where value is an aggregate data type that is
not too large.

This menas a lot of use cases: representing objects with small hashes.
Short lists for capped timelines. Small sets to link resources to a
list of tags. And so forth.

Your proposal of swapping single values does not work to solve this
for a number of reasons, the only alternative would be to directly
model our types (lists, sets, hashes, ...) on-disk, with on-disk data
structures that can directly support this kind of operations. But this
is not Redis business, we can't make all the users happy.

On the other hand, there are a few cases where a problem appear to be
needing large lists, sets, sorted sets, and so forth, but instead you
can model it using small data structures. In all this cases you can
turn your problem that has good solutions only with the in-memory
backend into one with good solutions with diskstore.

Example: long timeline? Segment it into N keys where every is a list
with max 500 elements. Since timelines have usually an access pattern
where most of the time you need to retrieve only the latest entries
this will work very well, and if somebody keeps hitting "read more"
you'll retrieve data from the next keys.

Another example: Big set to check if something exists in a given
category? Instead of using a set use plain keys, with prefixes. So
every key can go inside and outside the diskstore cache independently.

And so forth.

Cheers,
Salvatore

--
Salvatore 'antirez' Sanfilippo
open source developer - VMware

Salvatore Sanfilippo

unread,

Jan 7, 2011, 9:58:07 AM1/7/11

to redi...@googlegroups.com

Thank you for the update Tim, this is truly helpful.
We'll investigate the issue for sure.

Cheers,
Salvatore

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>

--

Salvatore 'antirez' Sanfilippo
open source developer - VMware

http://invece.org

Sam Stokes

unread,

Jan 7, 2011, 11:38:47 PM1/7/11

to Redis DB

As another data point, we've also experienced memory growth with VM
enabled with a dataset containing a lot of large sorted sets. The
behaviour we saw was similar to what Tim described (and as described
in http://code.google.com/p/redis/issues/detail?id=394): Redis
initially consumed substantially more than vm-max-memory, and then
grew by about 1GB per week (with dataset remaining around 6GB). We
didn't manage to definitively isolate the sorted sets as the culprit,
though.

At the London Redis meetup I got a chance to briefly discuss this with
Pieter (sorry I never replied to your email btw!) and he said (correct
me if I'm wrong) it was likely because each time Redis swapped a value
back into memory, it had to allocate memory for it: thus any
fragmentation effect would occur every time a value was swapped in,
and for large compound values like zsets the fragmentation for each
value could be high. (Pieter also said that he expected the memory
growth to eventually converge on a stable value and stay there, but I
watched Redis double in size without that happening, so I guess the
stable value is still quite high!)

We were running Redis 2.0 when all this happened, but as I understand
it sorted sets didn't get much optimisation between 2.0 and 2.2?

Do you think the new diskstore branch is likely to get around this
behaviour - perhaps because of the simpler representation of values on
disk?

--
Sam

Salvatore Sanfilippo

unread,

Jan 8, 2011, 5:44:45 AM1/8/11

to redi...@googlegroups.com

On Sat, Jan 8, 2011 at 5:38 AM, Sam Stokes <s...@rapportive.com> wrote:

Hello Sam, thank you for your email,

> We were running Redis 2.0 when all this happened, but as I understand
> it sorted sets didn't get much optimisation between 2.0 and 2.2?

Sorted sets use 20% less memory for sure in 2.2. Any sorted set I
mean. They also perform less allocations so it is likely that they are
able to fragment memory less.

Btw what I think is that this problem is evident with sorted sets
since they allocate memory in a special way, that is, sorted set nodes
are not all the same length, there are sorted sets with just one link
and sorted sets with up to 15 links. So I think that what happen is
the following:

when you free a sorted set due to VM (but this will happen similarly
if you have a work load deleting sorted sets often), you end with
spaces of different sizes. Then Redis allocates more small objects,
that from time to time will use space that was bigger, so the next
time there will be to allocate a, let's say, 15 links node, old space
will be hard to reclaim as many of this pieces are now fragmented, and
new memory will be used. And so forth forever...

> Do you think the new diskstore branch is likely to get around this
> behaviour - perhaps because of the simpler representation of values on
> disk?

The problem described here is trivially solvable using a slab
allocator that can be implemented just touching zmalloc.c, so we can
implement this directly in 2.2 before shipping, as an optional
configuration flag (possibly set to yes by default, not sure). Why
optional? Because with work loads where there is no big risk of
fragmentation the usual allocator will use less memory. Slab
allocators allocate memory at segmented sizes (for instance power of
two) up to a given size, so it's a tradeoff between space used and
fragmentation.

Cheers,
Salvatore

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>

--

Salvatore 'antirez' Sanfilippo
open source developer - VMware

http://invece.org

Salvatore Sanfilippo

unread,

Jan 8, 2011, 6:16:57 AM1/8/11

to redi...@googlegroups.com

p.s. I'm trying to create a script to simulate this workload. Please
can you provide more information about the kind of work your sorted
set are doing?

If they get bigger and bigger, how much sorted sets, what the work
load is, and so forth.

Thanks!
Salvatore

Salvatore Sanfilippo

unread,

Jan 8, 2011, 6:36:37 AM1/8/11

to redi...@googlegroups.com

Forgot to mention that also if someone can provide a (long) append
only file creating fragmentation this could be helpful. Especially if
that AOF is from people experimenting fragmentation without using VM.

About that: is there somebody listening here experimenting
fragmentation without VM?

Thanks,
Salvatore

Mike Shaver

unread,

Jan 8, 2011, 9:28:35 AM1/8/11

to redi...@googlegroups.com

On Jan 8, 2011 2:45 AM, "Salvatore Sanfilippo" <ant...@gmail.com> wrote:
> The problem described here is trivially solvable using a slab
> allocator that can be implemented just touching zmalloc.c, so we can
> implement this directly in 2.2 before shipping, as an optional
> configuration flag (possibly set to yes by default, not sure). Why
> optional? Because with work loads where there is no big risk of
> fragmentation the usual allocator will use less memory. Slab
> allocators allocate memory at segmented sizes (for instance power of
> two) up to a given size, so it's a tradeoff between space used and
> fragmentation.

Could you not simply round sorted-set allocations up to get a smaller set of size classes? That would probably reduce zset fragmentation without affecting all other allocations.

Mike

Salvatore Sanfilippo

unread,

Jan 8, 2011, 10:23:15 AM1/8/11

to redi...@googlegroups.com

Hello Mike, this is not possible since from the smallest to the
greatest zset node the space difference is too big.
Also the same can happen with other allocation patterns not involving
zset (for instance setting strings of increasing size, via APPEND, may
lead to similar problems).

Cheers,
Salvatore

> Mike

Mike Shaver

unread,

Jan 8, 2011, 2:11:16 PM1/8/11

to redi...@googlegroups.com

On Sat, Jan 8, 2011 at 7:23 AM, Salvatore Sanfilippo <ant...@gmail.com> wrote:
> Hello Mike, this is not possible since from the smallest to the
> greatest zset node the space difference is too big.
> Also the same can happen with other allocation patterns not involving
> zset (for instance setting strings of increasing size, via APPEND, may

For most allocators, once things are beyond a certain size they get a
dedicated mapping, meaning that they don't contribute to
fragmentation. Everything below that can be rounded up into a bucket
(power of two is popular, but there are probably other choices)
meaning that you get the reuse you want for avoiding fragmentation.

Another option is copying compaction, since the set of references to a
given object is pretty small and easy to find, but that would require
a more invasive change.

On the topic of allocators, there's a fantastic post by Jason Evans
about the work he did for fragmentation and performance improvements
on jemalloc, for Facebook's use:
https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919
. jemalloc might be a good allocator choice in the future; we're using
it for Firefox, and we're very sensitive to both allocation speed and
fragmentation.

Mike

Sam Stokes

unread,

Jan 9, 2011, 1:31:17 AM1/9/11

to Redis DB

Hi Salvatore, it's great to hear that a slab allocator would help with
this case. We'd have gladly allocated some extra RAM up front in
exchange for predictable RAM consumption (i.e. in VM mode RAM usage
staying constant over time).

Our main uses of sorted sets were aimed at rate limiting / abuse
detection:

1. a zset per minute, with 40-char hashes (representing unique users)
as values. The score for a user in a given zset would be how many
times we'd seen them that minute. The most common operation was
ZINCRBY 1.0. If the score returned by ZINCRBY was above a threshold,
we'd trigger a suspected abuse alert. Occasionally we'd also do a
ZCARD on the most recent 120 zsets so as to draw an "active users per
minute" graph. (We also had similar zsets per hour, day and month.)

The per-minute zsets would grow monotonically until their minute was
up, then stay constant (probably forever). The per-hour, per-day and
per-month zsets obviously had more time to grow (and would get a lot
bigger). The number of zsets also increased with time.

2. a zset per user, with IP addresses (as strings) as values. The
score was how many times we'd seen that user at that IP address.
Again, ZINCRBY 1.0 was the only frequent operation. These zsets ended
up being much bigger than we expected (average of hundreds of
different IP addresses per user, and some had over 9000!).

These would also grow monotonically, although much more slowly than
the user-activity zsets above. The number of them would increase as
our user base grew.

3. similar to 2 but with browser user-agent strings instead of IP
addresses. As many zsets as 2 but generally about 8x smaller.

Since 2 and 3 had a zset per user, it's quite possible those use cases
were thrashing our working set, just having to swap in a user's zset
just to ZINCRBY it and then swap it out again when the next user came
in. (This wasn't a performance-critical system, so I don't know what
the query latency was like, but we'd have noticed if it got really
slow.)

Unfortunately I can't provide an AOF, since a) we weren't using AOF
(the tradeoffs didn't make sense for our situation), and b) we've
since changed most of the above, so as to reduce the absolute size of
our dataset to the point where we could just turn off VM. (Since we
did that, mem_fragmentation_ratio has been relatively constant - we
also moved to 2.2 alpha. Unfortunately we changed too many things at
once to be able to conclude "turning off VM stopped the fragmentation"
or "2.2 stopped it".)

I hope that helps - sorry the story is a bit muddled, but we never had
the chance to complete the investigation and just had to pick any
solution which made our dataset run stably in Redis.

--
Sam

> >> For more options, visit this group athttp://groups.google.com/group/redis-db?hl=en.

Tim Lossen

unread,

Jan 28, 2011, 11:47:02 AM1/28/11

to redi...@googlegroups.com

last month, i complained that vm sucks. today i would like to
rehabilitate it somewhat -- left to itself (= no persistence,
no replication) vm works like a charm:

redis_version:2.0.1
uptime_in_days:36
changes_since_last_save:3139419007
vm_stats_swappin_count:16858675
vm_stats_swappout_count:18623203
db2:keys=1807395,expires=0

3 billion operations, without any hiccup so far. memory usage
is now completely stable as well:

USER PID %CPU %MEM VSZ RSS

Tue Jan 11 09:51:50 CET 2011
redis 24044 23.0 80.3 20061956 19871576

Fri Jan 28 16:30:44 CET 2011
redis 24044 24.8 80.3 20061956 19871516

so if you can live without persistence (i.e. for a large cache),
vm might still be an interesting option.

tim

On 2010-12-23, at 16:07 , Tim Lossen wrote:

> hello,
>
> we have been running redis with vm enabled in production for two
> months now, and we would like to share our experience so far.
>
> quick summary: redis with vm sucks.
>
>
> first, some context. we use redis as the main datastore for a
> quickly growing dataset. the total size is about 8 GB (on disk)
> now, and projected to grow to 500 GB or more eventually. it
> consists of redis hashes, which vary in size between 5 and 300
> KB. at any given time, only a small percentage of hashes are in
> use.
>
> in theory, this looks like an ideal use case for redis vm, as the
> "hot" dataset is small (only a few GB) and the individual values
> are rather large.
>
> in practice, we ran into the following problems:
>
>
> a) stability
>
> as reported in issue #395, redis 2.0.4 behaves very unstable with
> both vm and replication, crashing twice on us after less than an
> hour once we put load on it. we went back to 2.0.1, and we
> considered it rock solid until yesterday, when it unexpectedly
> crashed as well.
>
> http://code.google.com/p/redis/issues/detail?id=395
>
>
> b) memory consumption
>
> as reported in issue #394 (by brett), redis with vm "greatly
> exceeds vm-max-memory ... [and] continues to grow." our boxes
> have 24 GB of ram (and no swap). we have configured vm as
> follows:
>
> vm-enabled yes
> vm-max-memory 6gb
> vm-page-size 8kb
> vm-pages 50304000
> vm-max-threads 4

>
> after starting up an empty instance and populating it with 8 GB
> of data, the redis process uses 16.9 GB of memory and continues
> to grow over time -- first quickly, then more and more slowly.
> the growth never completely stops though, so there seems to be
> some kind of memory leak.
>

> http://code.google.com/p/redis/issues/detail?id=394
>
>
> c) persistence / durability
>
> because of a), it would be very desirable to make frequents
> backups. but available redis persistence options -- snapshots
> and aof -- do not play well with vm.
>
> to create a snapshot (with BGSAVE / SAVE), redis first has to
> swap all data into memory before writing it out to disk again.
> with a large dataset, this will take extremely long, and soon
> becomes unfeasible -- especially in combination with b), as the
> snapshotting thread consumes even more memory, and will
> frequently be killed by the os before the snapshot can be
> completed.
>
> we assume aof (BGREWRITEAOF) behaves essentially the same,
> although we have not tried that.
>
>
> d) replication
>
> unfortunately, replication soon becomes unfeasible as well, as
> connecting a new slave forces the master to create a snapshot
> internally, see c).
>
>
> so, to sum this up -- after a while, you are stuck with an
> in-memory database that you cannot backup, cannot replicate to
> a standby machine, and that will eventually consume all memory
> and crash (if it does not crash earlier).
>
> conclusion: redis with vm enabled is pretty much unusable, and we
> would really not recommend it to anybody else for production use
> at the moment. (at least not as a database, it might work better
> as a cache.)
>
>
> as a workaround, we are going to turn off vm, and let the
> application dump / restore individual hashes on-demand to / from
> disk -- effectively we are going to write our own virtual memory
> implementation in ruby.
>
> although the main focus of redis development is on redis cluster
> now, this will only partly solve the problem of large datasets.
> to make vm a feasible option, at least a better persistence
> strategy is urgently needed -- maybe by turning the swap file
> into a durable representation of the whole dataset.
>
>
> cheers
> tim
>
> --

> http://tim.lossen.de
>
>
>
> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.

Salvatore Sanfilippo

unread,

Jan 28, 2011, 11:54:21 AM1/28/11

to redi...@googlegroups.com

Tim thank you for sharing,

there is clearly some problem about VM + replication in 2.0, that can
lead to a crash.
I think the problem should not be present in 2.2.

And there is, with 2.2 also, the problem with persistence that is
simply too slow...

Another major problem remains the fact that it is not an option when
the problem is having a lot of keys, as keys can't be swapped.
Hopefully diskstore can fix all this issues. For instance with
diskstore BGSAVE works very well already, and uses a thread. But
things should seriously improve once we'll provide the B-tree storage
option. Today I just finished the on disk allocator that is the base
for my tree implementation, so things should start moving faster now
about all this.

Cheers,
Salvatore

--

Sam Stokes

unread,

Jan 30, 2011, 8:57:08 PM1/30/11

to Redis DB

Hey Tim, any chance you could share the used_memory_human value for
that instance (also from the INFO command)? I'm curious to see at
what point your memory usage stabilised.

Thanks,
--
Sam

Tim Lossen

unread,

Jan 31, 2011, 5:27:16 AM1/31/11

to redi...@googlegroups.com

yes, no problem:

used_memory:6442442048
used_memory_human:6.00G

this is consistent with our configuration:

vm-max-memory 6gb

tim

Sam Stokes

unread,

Feb 1, 2011, 12:08:47 AM2/1/11

to Redis DB

Thanks very much Tim! So just to confirm I'm reading your last two
posts right: vm-max-memory is 6GB, the process is actually taking up
20GB of RAM (from the RSS column in the post before that), but the
memory usage seems to have stabilised there? (And I assume your
dataset is bigger than 20GB? :))

Pieter did advise me that the memory usage would stabilise if I let it
grow for long enough. If I understand your figures correctly, that is
what happened, but only once the memory usage hit more than 3x vm-max-
memory.

Also, when you say "no persistence", do you mean you've even turned
off BGSAVE? We seemed to notice that BGSAVE exacerbated the memory
growth, although sounds like 2.2 helps with that problem.

Tim Lossen

unread,

Feb 1, 2011, 11:10:23 AM2/1/11

to redi...@googlegroups.com

yes, exactly. our total dataset is around 40 gig now (and growing),
and we have turned off all standard redis persistence options.

instead, we have a cronjob running that dumps hashes to disk
which have recently been modified, storing them as yaml files.
if redis should crash, we'd import all the yaml files into a
new redis instance (takes around 30 minutes) and we'd lose about
10 minutes of unsaved changes.

we hope to replace this homegrown diskstore precursor with the
real thing, eventually :)

tim

Jonathan Leibiusky

unread,

Feb 1, 2011, 11:18:34 AM2/1/11

to redi...@googlegroups.com

Are you using local storage for vm?
Why aren't you using AOF instead of yaml files?

Sam Stokes

unread,

Feb 1, 2011, 11:39:08 PM2/1/11

to Redis DB

Awesome - thanks for the detail, and for experimentally confirming
that Redis VM's memory usage does indeed stabilise eventually. I'd
say that means VM *is* fit for production use after all, provided you
can give up replication and persistence :) (And maybe 2.2 removes
those caveats too!)

--
Sam

Tim Lossen

unread,

Feb 2, 2011, 9:08:30 AM2/2/11

to redi...@googlegroups.com

> Are you using local storage for vm?

yes, fast local disks in a raid1 setup.

> Why aren't you using AOF instead of yaml files?

AOF wouldn't work in our case (see first message of this thread).
however, you could consider the yaml files as an "in-place" AOF
that never needs to be compacted.

tim

Reply all

Reply to author

Forward