Plans for Virtual Memory

Salvatore Sanfilippo

unread,

Dec 30, 2009, 1:40:01 PM12/30/09

to Redis DB

Hello!

Now that Redis got BLPOP (and MULTI/EXEC even if this was not in the
TODO list), the next big thing to do is to implement Virtual Memory.
In theory the TODO lists the implementation of Hashes before the VM
thing, but I think it's better to do it the other way around and start
from VM, as I think I'll end rewriting most of the memory->disk
serialization code for existing types, so to implement VM first and
then model the new type on the new interface seems the best thing to
do.

So basically starting from the first days of the new year I'll start
to actively work on Virtual Memory, and I want to share with you my
plan.

The first thing I want to get working is a blocking implementation of
VM. This means that Redis will work as usually, but will have a new
layer to access keys that will be able to understand if a key is in
memory or swapped out on disk: when Redis tries to access an on-disk
key, it will block to load the key from disk to memory (this includes
not only I/O, but also CPU time needed to convert the serialized
object into the memory representation).

On the other hand, in the main event loop, Redis will test from time
to time if too much memory is used (accordingly to a VM parameter in
the config file), and if this is the case, Redis will select a few
random keys and swap the most convenient on disk, using an aging
algorithm so that rarely used keys are more likely to be swapped (but
Redis will also test the size of the contained value, so it will try
to swap bigger values if the last access time is the same).

While to implement all this is not trivial as there are a lot of
details to get right, it's not too hard, as everything is blocking in
this design. There is no synchronization needed, nor threads. But this
design is slow... by design :) While Redis is blocked loading some key
from disk to memory, or the contrary, other clients will not be able
to access keys that are already in memory. This will seriously limit
the performance. But fortunately this is just a first step that will
never be released as stable. It's just a first step in order to avoid
to do all the work in a single pass.

Step 2: non blocking VM

In step 2 there is the need to block only clients dealing with
swapped-out keys. And at the same time to swap out keys in background
when we are low on memory.
There are two solutions I'm considering currently to turn the initial
design into a non blocking one:

a) Turn Redis into a threaded server, so that every client will be
served by a different thread. This will make VM simpler as we can
block without to care, but every resource requires proper locking, and
my fear is that we may end with a Redis that is actually slower than
the current one (unless there are a number of slow operations running
on the instance, like SORT, that is not a common usage). I'm not
discarding this solution, but I don't like it too much.

b) Use the current single thread design, with the synchronous block,
but add a new simple layer that does the following: when a client
issues a new command, before to process such a command we check the
arguments vector to check if there are keys that are needed but are
swapped out. If so we suspend the client (and now we have code to do
this, thanks to the BLPOP implementation) and start a thread to load
the key in background: this requires very little locking as the thread
needs just to access the swap file, read the data, and create brand
new redis objects to rebuild the data structure. When the thread is
done with his work it notifies Redis that decrements a counter
associated with the blocked client. When all the keys swapped out are
loaded the client is "restarted" (like it happens already with BLPOP).

The same will happen in order to transfer objects from memory to disk,
we'll use another thread and mark the key in a special way. If there
is the need to access this key while the thread is moving the value on
disk, we just kill the thread and mark the value as in-memory again.

What's interesting about design "b" is that can be used for other
stuff, for example for CPU-intensive commands like "SORT".

Ok I think this is almost everything I'm thinking about Redis VM
currently. Any comment is welcomed.

Cheers,
Salvatore

--
Salvatore 'antirez' Sanfilippo
http://invece.org

"Once you have something that grows faster than education grows,
you’re always going to get a pop culture.", Alan Kay

Sergey Shepelev

unread,

Dec 31, 2009, 4:42:54 AM12/31/09

to redi...@googlegroups.com

> --
>
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>
>

You could split Redis to 2 processes, one serves clients, second would
be serving VM, blocking to disk and with some kind of API you make it
read and write swap file. And would need some kind of IPC to transfer
data efficiently, like mmap or sendfile :)

Salvatore Sanfilippo

unread,

Dec 31, 2009, 5:59:42 AM12/31/09

to redi...@googlegroups.com

On Thu, Dec 31, 2009 at 10:42 AM, Sergey Shepelev <tem...@gmail.com> wrote:

> You could split Redis to 2 processes, one serves clients, second would
> be serving VM, blocking to disk and with some kind of API you make it
> read and write swap file. And would need some kind of IPC to transfer
> data efficiently, like mmap or sendfile :)

Hello Sergey,

this was my first design actually, but unfortunately it does not work
well because it takes a lot of CPU time to serialize/unserialize data,
and this in the two-processes design happens in the "main" process
(the one serving clients).

Not only this, many high-end servers are not used at their max if
there is only a single read or write operation per time, as what looks
like a single logical file system can be split across different disks
and so forth.

With the threaded design the reading/writing thread can not only
handle the I/O but can also convert Redis objects to/from the disk
serialization format, and this does not require locking as when
objects are transfered to disk we can simply increment their reference
counting before to start the thread, and this will guarantee no one
will destroy the object (and Redis objects are immutable, so no one
can change this objects as well), when instead the thread requires to
read an object from disk it can use the object allocation functions
that are thread safe (the only thing that's not thread safe is that in
zmalloc.c we increment an integer to take the sum of all the allocated
memory, but this requires a trivial locking).

fmardini

unread,

Dec 31, 2009, 8:38:16 AM12/31/09

to Redis DB

All of this sounds very interesting :)

Poul-Henning Kamp (the designer of Varnish the HTTP cache) has a very
good discussion about virtual memory at http://varnish.projects.linpro.no/wiki/ArchitectNotes
It basically says that's it's the kernel's job to worry about memory.
The approach described there might work with Redis, unless the manual
VM management is also meant to deal with persistence
Ideas from BLPOP's implementation could be used to block a client
until the data is read from disk

Thanks Salvatore and in bocca al lupo :)

Fouad Mardini

On Dec 31, 12:59 pm, Salvatore Sanfilippo <anti...@gmail.com> wrote:

> Salvatore 'antirez' Sanfilippohttp://invece.org

Aníbal Rojas

unread,

Dec 31, 2009, 11:47:29 AM12/31/09

to redi...@googlegroups.com

Salvatore,

I am not a low level programmer, but here are my 5 cents.

Considering b) I would love to have "slow" operations implemented
in a non-blocking way so other clients can be served, slow including
not only recovering data from disk, but SORT, etc, because in th end
this will mean the first version of a multi-threaded Redis, please
correct me if I am wrong.

Also, won't it be better to have a configurable number of "slow
operations processes" ready to handle this kind of requests?

Happy New Year!

--
Aníbal Rojas
Ruby on Rails Web Developer
http://www.google.com/profiles/anibalrojas

Salvatore Sanfilippo

unread,

Dec 31, 2009, 11:50:42 AM12/31/09

to redi...@googlegroups.com

On Thu, Dec 31, 2009 at 2:38 PM, fmardini <f.ma...@gmail.com> wrote:
> All of this sounds very interesting :)
>
> Poul-Henning Kamp (the designer of Varnish the HTTP cache) has a very
> good discussion about virtual memory at http://varnish.projects.linpro.no/wiki/ArchitectNotes
> It basically says that's it's the kernel's job to worry about memory.

Hello, in the Redis FAQ there is an entry about why this does not work
with Redis.
In the FAQ you'll find the details, but the short story is:

a) in a single memory page you likely have Redis objects about a lot
of different values. So maybe 10% of the whole 4096 bytes of a page is
used, but the whole page will never get swapped.
b) while an HTML document is the same size if stored on disk or on
memory, a Redis object like a list of 10 small elements, stored on
disk is 10 times smaller. This means Redis VM does 1/10 of the I/O
required by the kernel VM, that is not usable anyway because of 'a'
c) (not in the FAQ), the kernel VM works in a blocking fashion. When
there is a page fault because the memory page is on disk, the process
blocks until the kernel does not bring the page back on memory. Not
good for a single threaded design.

> The approach described there might work with Redis, unless the manual
> VM management is also meant to deal with persistence

No persistence will be a completely different matter: it will actually
not change at all if not for a minor thing: Redis will "annotate" in
snapshots what objects are on disk, so that on DB reload there will
not be a "slow start" effect. This can't be done with the append only
file unfortunately, even if it's not impossible to use some form of
"annotation" when rewriting the log with BGREWRITEAOF.

> Ideas from BLPOP's implementation could be used to block a client
> until the data is read from disk

Indeed, this is what I'll use when I'll switch to a threaded
implementation. But this is entirely not needed with your proposed
kernel-vm solution, as it's transparently handled by the kernel.

> Thanks Salvatore and in bocca al lupo :)

Thanks!

Ciao,
Salvatore

--
Salvatore 'antirez' Sanfilippo

Salvatore Sanfilippo

unread,

Dec 31, 2009, 11:56:40 AM12/31/09

to redi...@googlegroups.com

2009/12/31 Aníbal Rojas <aniba...@gmail.com>:
> Salvatore,

Hello Aníbal

> Considering b) I would love to have "slow" operations implemented
> in a non-blocking way so other clients can be served, slow including
> not only recovering data from disk, but SORT, etc, because in th end
> this will mean the first version of a multi-threaded Redis, please
> correct me if I am wrong.

Indeed that's entirely possible with almost everything *but* with
"SORT BY" that is where it would be more useful :)
This is because it's simple to start a thread to perform some kind of
operation on a given value, but BY will try to fetch data from the
Redis hash table while working, so this is much more complex. Still
there are tricks, like performing a blocking part on the main process
(the part that copies data + scores to a vector, this is how it works
already), and then perform the actual sorting in another thread.

No problems with SORT without BY / GET.

> Also, won't it be better to have a configurable number of "slow
> operations processes" ready to handle this kind of requests?

Yes, we'll have this options, the max number of threads performing the
I/O operations. Not sure if the thread performing the swap-off will be
just one btw (but more swap-in threads can be allowed). Have to think
to this details more as it's a "stage two" problem :)

> Happy New Year!

Thanks, Happy New Year!

Aníbal Rojas

unread,

Dec 31, 2009, 12:19:00 PM12/31/09

to redi...@googlegroups.com

Salvatore,

>> Considering b) I would love to have "slow" operations implemented
>> in a non-blocking way so other clients can be served, slow including
>> not only recovering data from disk, but SORT, etc, because in th end
>> this will mean the first version of a multi-threaded Redis, please
>> correct me if I am wrong.
>
> Indeed that's entirely possible with almost everything *but* with
> "SORT BY" that is where it would be more useful :)

Now we have ZSETS :) but SORTs, SINTER, SUNION, etc will be used more
and more as people model more complex problems with Redis.

In my tests the only way to slowdown Redis was to throw a lot of slow
operations (SORTs) to it, so the benefits are obvious.

> This is because it's simple to start a thread to perform some kind of
> operation on a given value, but BY will try to fetch data from the
> Redis hash table while working, so this is much more complex. Still
> there are tricks, like performing a blocking part on the main process
> (the part that copies data + scores to a vector, this is how it works
> already), and then perform the actual sorting in another thread.

Yes, I was thinking that SORT have some unique... features that will
require especial care :)

>> Also, won't it be better to have a configurable number of "slow
>> operations processes" ready to handle this kind of requests?
>
> Yes, we'll have this options, the max number of threads performing the
> I/O operations. Not sure if the thread performing the swap-off will be
> just one btw (but more swap-in threads can be allowed). Have to think
> to this details more as it's a "stage two" problem :)

Makes sense to think on a single Swapping off thread, the only
"client" of this process is the Redis server itself. But Swap In
process "serve" multiple clients.

Turn off the computer, go and get something to drink and celebrate.

--
Aníbal

Steve Farrell

unread,

Dec 31, 2009, 12:45:07 PM12/31/09

to redi...@googlegroups.com

Salvatore,

I wonder if you could explain a little bit more about the use cases for virtual memory. You mentioned it would enable certain kinds of applications that are not currently available... what do you have in mind?

Here's how I'm looking at it: redis is fantastic for frequently-accessed, frequently-changing data. You want that stuff to be in memory 100% of the time - if it gets swapped at all, you probably need to revise your system architecture.

Virtual memory is a great technique for desktop operating systems because it allows users to run more applications then they can fit in ram, and only pay a price when the switch between them. As long as users do this infrequently, the net-result is positive... and since memory is expensive, it's really the only practical option. On servers though... my sense is that if I have a server swapping, I've got a problem to fix. Virtual memory on a server is almost more of a canary in a cole mine - something to tell you gently that you've got a big problem.

The point is this: if I have infrequently accessed data, then why would I store it in redis? If I have a frequently accessed data that doesn't fit in memory, I need more memory or I need to re-architect my application.

Steve Farrell

unread,

Dec 31, 2009, 12:47:57 PM12/31/09

to redi...@googlegroups.com

By the way, I don't mean to draw too close a parallel to os virtual memory - I realize you have major application-specific optimizations. I was just using that as an analogy...

Luke Melia

unread,

Dec 31, 2009, 1:07:29 PM12/31/09

to redi...@googlegroups.com

On Dec 31, 2009, at 12:45 PM, Steve Farrell wrote:

> I wonder if you could explain a little bit more about the use cases
> for virtual memory. You mentioned it would enable certain kinds of
> applications that are not currently available... what do you have in
> mind?

I can share a use case that has me excited about this feature: I work
on a social networking site that has Facebook-style activity feeds.
When an activity of interest occurs on the site, we create an Activity
object in mysql, cache it's HTML, and then distribute the ID of the
activity to a redis list for all the users who may find it
interesting. Redis' list data structure is great for this.
Distribution is fast and easy. Trimming the list to a max size during
distribution is easy. And this architecture lets us render a feed very
quickly when a user visits the site.

Right now, however, we are keeping lists in memory for users who
rarely access the site and have few social connections. My hope is
that virtual memory in Redis will allow this to be an architecture
that will scale with memory requirements that match the "frequently
accessed data" portion better than it does now.

Cheers,
Luke
--
Luke Melia
lu...@lukemelia.com
http://www.lukemelia.com/

paulino huerta

unread,

Jan 1, 2010, 6:42:36 AM1/1/10

to redi...@googlegroups.com

Hello Salvatore and thanks for sharing,

If I am going using Redis from apps, and need my dataset fit in memory, reason for which I choose Redis :)
My keys are 200 and I need that always they are fit in memory.
I think, that the server must takes care of this situation, maybe by a parameter of configuration.
The point is: to be able to establish a kind of priority in the storage cache/disk-swap

Cheers
--Paulino

2009/12/30 Salvatore Sanfilippo <ant...@gmail.com>

András Bártházi

unread,

Jan 1, 2010, 7:11:25 AM1/1/10

to redi...@googlegroups.com

Hi,

> a) Turn Redis into a threaded server, so that every client will be

I believe this is a bad idea. Why I like nginx and nodejs is that
they're not threaded.

> b) Use the current single thread design, with the synchronous block,
> but add a new simple layer that does the following: when a client

This version sounds cool for me.

Happy new year and other holidays,
Andras

Salvatore Sanfilippo

unread,

Jan 1, 2010, 7:13:31 AM1/1/10

to redi...@googlegroups.com

On Thu, Dec 31, 2009 at 6:45 PM, Steve Farrell <st...@farrell.org> wrote:
> Salvatore,
> I wonder if you could explain a little bit more about the use cases for
> virtual memory. You mentioned it would enable certain kinds of applications
> that are not currently available... what do you have in mind?

Hello Steve!

Redis virtual memory has two goals:

1) If the dataset access pattern is not random, but there is a bias
towards a subset of keys (let's call this subset of keys the "hot
spot"), with VM Redis can deliver performance similar to the case
where you have in memory only the hot spot, using only the memory
required to hold the hot spot.

You asked, but if I know some data not needed why to bother taking it
into Redis? Because you don't know what data will not be used.
Actually it's harder than that, maybe all the data gets accessed at
some point, but with *very* different frequency. Or even more
interesting: the hot spot changes with time: today is a subset of
keys, wait two weeks and it will be a different subset of keys.

In all this conditions Redis will perform nearly as well as if the
whole data fit in memory, thanks to VM.

2) Your hotspot is much bigger than your RAM, but you are willing to
pay a performance penalty because you want to use Redis.

> Here's how I'm looking at it: redis is fantastic for frequently-accessed,
> frequently-changing data. You want that stuff to be in memory 100% of the
> time - if it gets swapped at all, you probably need to revise your system
> architecture.

the point number 1 allows you to have this even when it looks impossible :)

> Virtual memory is a great technique for desktop operating systems because it
> allows users to run more applications then they can fit in ram, and only pay
> a price when the switch between them. As long as users do this
> infrequently, the net-result is positive... and since memory is expensive,
> it's really the only practical option. On servers though... my sense is
> that if I have a server swapping, I've got a problem to fix. Virtual memory

Indeed, if you have frequent swapping in a production server there is
a problem to fix, but that's because your processes usually are not
designed to use the whole RAM as a huge cache, like Redis does with
VM.
Also Redis VM is much different than OS VM as it's very specific: it's
like a precise garbage collector against a conservative garbage
collector. Redis can "see" what values to swap and if it's a good
idea. OS cache works with very low-level heuristics and at hardware
page level.

> on a server is almost more of a canary in a cole mine - something to tell
> you gently that you've got a big problem.
> The point is this: if I have infrequently accessed data, then why would I
> store it in redis? If I have a frequently accessed data that doesn't fit in
> memory, I need more memory or I need to re-architect my application.

Ok assuming you take in Redis some important user meta-data. Your
service grows a lot, and this data no loger fit the memory. Still most
of this users access your site once an year, and most will *never*
access you site again.

Your solution is to re-architect your application, so you take a MySQL
server and store this rarely used meta-data on MySQL instead to take
it into Redis. If somebody will ask for them, you transfer it from
MySQL to Redis.
This is a memcached-alike usage. What you did actually is to implement
a slow, application level Virtual Memory for Redis :)

Redis with VM is like Mysql + Redis where Redis just acts as a cache.
But in a truly transparent way.

Cheers,
Salvatore

Salvatore Sanfilippo

unread,

Jan 1, 2010, 7:20:21 AM1/1/10

to redi...@googlegroups.com

On Fri, Jan 1, 2010 at 12:42 PM, paulino huerta <paulin...@gmail.com> wrote:
> Hello Salvatore and thanks for sharing,
>
> If I am going using Redis from apps, and need my dataset fit in memory,
> reason for which I choose Redis :)
> My keys are 200 and I need that always they are fit in memory.
> I think, that the server must takes care of this situation, maybe by a
> parameter of configuration.
> The point is: to be able to establish a kind of priority in the storage
> cache/disk-swap

Hello Paulino,

no problem! A default Redis install will work exactly like today,
without VM. And will be as fast :)
To enable VM there is to set a number of configuration parameters, like this:

vm-enabled yes
vm-pagesize 256
vm-pages 100000
vm-max-ram 2g

This tells Redis to enable VM, to use it only if the dataset does not
fit in "vm-max-ram" bytes (2 GB in the example), to use a page size of
256 bytes (a good idea if the rarely used keys will be small) and to
use 100000 swap pages (a 24MB swap file).

Steve Farrell

unread,

Jan 2, 2010, 2:52:03 PM1/2/10

to redi...@googlegroups.com

On Jan 1, 2010, at 4:13 AM, Salvatore Sanfilippo <ant...@gmail.com>
wrote:

> Your solution is to re-architect your application, so you take a MySQL
> server and store this rarely used meta-data on MySQL instead to take
> it into Redis. If somebody will ask for them, you transfer it from
> MySQL to Redis.
> This is a memcached-alike usage. What you did actually is to implement
> a slow, application level Virtual Memory for Redis :)
>

Yes, that makes sense.... the question is whether the application
designer can write to redis alone, or whether he has to use redis as a
kind of intermediary for some data -- hence your comparison with
memcache.

Clearly if he can write to redis alone, that's a major convenience for
the designer. However, that convenience is only worthwhile if he can
get as good results. I think that's the part I'm trying to
understand...

Let's take the example of retwis (which I think covers the use case
Luke mentioned)... so my first question is this: the problem with
retwis is that it'll run out of memory pretty quickly, as each new
message gets stored in a redis list in memory. Redis virtual memory
would seem to help here... so when you do lists using virtual memory,
will you actually store the list in different segments on disk? In
other words, will it be possible to push data onto the list without
having to materialize the whole thing? Likewise for paging? (Sounds
doable and useful...). Likewise for zsets, I assume?

If you do that, what's the disk storage arrangement you have in mind?
Linked lists work nicely in memory where fragmentation isn't too
expensive, but on disk it's a bigger issue.

Another question I have is also about physical storage layout. Ok, so
let's say in retwis you decide to do a memory optimization: strip off
the message content and store it in a set. This way when a message is
delivered to, say, 100k followers, you don't store 100k copies of the
message text. (I assume this makes sense and is a good idea - let me
know if you don't think so.) Now, over time this set of messages will
get very large... which could be ok... but what worries me is that if
you read a page of a timeline, the messages that will need to be
dereferenced will be scatter shot all over the disk, and you'll end up
having to page in the entire set, which might not even fit into
memory. (Well, thinking out loud: if you used a zset ordered by
timestamp, and you expected temporal locality, you might be able to
solve this problem...)

Salvatore Sanfilippo

unread,

Jan 2, 2010, 4:00:43 PM1/2/10

to redi...@googlegroups.com

On Sat, Jan 2, 2010 at 8:52 PM, Steve Farrell <st...@farrell.org> wrote:

> Yes, that makes sense.... the question is whether the application
> designer can write to redis alone, or whether he has to use redis as a
> kind of intermediary for some data -- hence your comparison with
> memcache.

Of coures for the nature of Redis you can use it as a memcached on
steroids, but this is just one of the Redis use cases that are
actually three:

a) memcached ++
b) messaging system
c) database

Virtual Memory is about "c", the way it is conceived will make it
similar to what you get from MySQL + memcached, because you can store
more data than memory, but you have an hot spot of data with good
performances.
But unlike a DB + memcached your application is not forced to
implement caching logic, everything happens inside Redis.

append only log + VM + redis-cluster are all steps towards making the
"c" solution strong.

> Let's take the example of retwis (which I think covers the use case
> Luke mentioned)... so my first question is this: the problem with
> retwis is that it'll run out of memory pretty quickly, as each new
> message gets stored in a redis list in memory. Redis virtual memory
> would seem to help here... so when you do lists using virtual memory,
> will you actually store the list in different segments on disk? In

No, a single Redis object will be stored into a contiguous part of the
Redis swap file.

> other words, will it be possible to push data onto the list without
> having to materialize the whole thing? Likewise for paging? (Sounds
> doable and useful...). Likewise for zsets, I assume?

If you push data against a list, and the list is on disk, it will be
reloaded into memory: the whole list.
Virtual Memory is about swapping objects that are rarely used.

So what happens with the Retwis design?

1) Redis will be able to transfer on disk a lot of sets, for instance
the "followers" sets of not active users.
2) Redis will be able to transfer on disk old messages, both of active
and not active users, as messages are registered as
one-message-per-key, and we push just IDs into lists.

User timelines instead will be stressed even if the users are not
active, if they have active followers, how to fix this problem?

A good idea is to segment the data into lists of 50 elements each. Old
"segmets" will be swapped on disk.

Every kind of application will be helped by VM, but if the designed is
*aware* of VM it can think at a data layout that can help VM a lot.

> If you do that, what's the disk storage arrangement you have in mind?
> Linked lists work nicely in memory where fragmentation isn't too
> expensive, but on disk it's a bigger issue.

This will not be a problem as I already said every object will be
serialized as a whole thing not editable on disk, but just on memory.

> Another question I have is also about physical storage layout. Ok, so
> let's say in retwis you decide to do a memory optimization: strip off
> the message content and store it in a set. This way when a message is
> delivered to, say, 100k followers, you don't store 100k copies of the
> message text. (I assume this makes sense and is a good idea - let me

This is already this way :)

> know if you don't think so.) Now, over time this set of messages will
> get very large... which could be ok... but what worries me is that if
> you read a page of a timeline, the messages that will need to be
> dereferenced will be scatter shot all over the disk, and you'll end up
> having to page in the entire set, which might not even fit into
> memory. (Well, thinking out loud: if you used a zset ordered by
> timestamp, and you expected temporal locality, you might be able to
> solve this problem...)

If you read a page of a very old timeline, you'll access 10 old
messages, that will be loaded form disk into memory. Every message is
stored in a single key.

But I got your point. What happens if I've a *huge* object that will
be stored on disk?
Nothing special, but of course the query about this object will take a
lot of time to complete, and more memory will be used. You can set
Redis to use at max a given amount of RAM when using VM. Probably if
your application is dealing with huge objects stored as a whole, it's
a good idea to don't allocate all the RAM, but leave a given
percentage free so that when this big objects gets loaded on memory
the kernel will not swap Redis memory pages on disk.

Given that Redis has full control on what it swaps out, it can also be
a good idea if we add a parameter to the VM that tells the minimum and
maximum size of objects that can be swapped: this way we can avoid
both that very small values are transfered on disk, and that huge
values are transfered on disk (because big values are more likely to
be transfered in general, as the algorithm will think it's a big win
to transfer a 100 MB list that is not used by a few hours compared to
a 1 MB list that is not used by two days).

Cheers,
Salvatore

Steve Farrell

unread,

Jan 2, 2010, 4:38:12 PM1/2/10

to redi...@googlegroups.com

Hi Salvatore,

On Sat, Jan 2, 2010 at 1:00 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote:

A good idea is to segment the data into lists of 50 elements each. Old
"segmets" will be swapped on disk.

It would be very compelling, I think, if redis vm could do this behind the scenes for lists and zsets and any other collection where it makes sense. This way you could have a timeline with 1 million messages and still load the head of it efficiently.

But I got your point. What happens if I've a *huge* object that will
be stored on disk?

By "huge object", do you mean like a blob, or like a big collection? In any event, what I was thinking of with this example is the case of fetching many small objects. All objects will be stored, effectively, randomly on the disk, right? I worry about the scalability of that, because if you're accessing, say, thousands of objects a second that are scattered all over the disk, the OS disk cache will get blown, and every one of those lookups is going to require a seek. See, for example, the use of sequential primary keys to ensure temporal locality for friendfeed: http://bret.appspot.com/entry/how-friendfeed-uses-mysql

With that discussion in mind, it might make sense to consider a primary key function for redis. It could be totally optional, of course, but if such keys are used, the vm could make an effort to lay things out on disk sequentially. Or maybe there's a clever way to do this behind the scenes?

Given that Redis has full control on what it swaps out, it can also be
a good idea if we add a parameter to the VM that tells the minimum and
maximum size of objects that can be swapped: this way we can avoid
both that very small values are transfered on disk, and that huge
values are transfered on disk (because big values are more likely to
be transfered in general, as the algorithm will think it's a big win
to transfer a 100 MB list that is not used by a few hours compared to
a 1 MB list that is not used by two days).

You might not need this if you can segment collections.

I'm worried I don't understand your model well enough yet with respect to fragmentation. Let's say you have a timeline with 1M entries on it. It's stored on blocks 1213-1233 on disk. Now you want to push a new entry onto it. You peel the block off of disk, mark blocks 1213-1233 as free (right?), keep it in memory for a while, and then eventually write it back out to disk, but it might not fit, in which case now it gets written to blocks 2342-2364. And with retwis, say there is an inactive users with 10k followers, 10% are inactive and have been swapped to disk... he tweets... now you have to peel 1k timelines off of disk, keep them in memory for a while (ie, push out a lot of other stuff), and then write them all back to new places on disk... am I missing it, or does that sound right?

Salvatore Sanfilippo

unread,

Jan 3, 2010, 7:41:23 AM1/3/10

to redi...@googlegroups.com

On Sat, Jan 2, 2010 at 10:38 PM, Steve Farrell <st...@farrell.org> wrote:
> Hi Salvatore,

Hello Steve,

>> A good idea is to segment the data into lists of 50 elements each. Old
>> "segmets" will be swapped on disk.
>
> It would be very compelling, I think, if redis vm could do this behind the
> scenes for lists and zsets and any other collection where it makes sense.
> This way you could have a timeline with 1 million messages and still load
> the head of it efficiently.

I understand your point but my feeling is that this is a premature
optimization, and also that this is not the goal of VM.
VM is not about manipulating data on disk, it's just a way to only
take in memory objects that are used more frequently and transfer on
disk objects that are not used frequently.

Now it's clear that your point is valid as I could ask: what about if
you see an Hash or a List, or a sorted set as a data space itself and
want to transfer rarely used "parts" of this data structures on disk?
It makes sense, but I think it's not the good direction. It is a
premature optimization, addresses something that can be addressed
without too efforts at database-design stage, uses the swap file a lot
less efficiently (it's very different to save a serialized data
structure compared to saving it in a way that can be manipulated),
*and* I'm not convinced at all this will lead to better performances.

>> But I got your point. What happens if I've a *huge* object that will
>> be stored on disk?
>
> By "huge object", do you mean like a blob, or like a big collection? In any

A huge collection.

> event, what I was thinking of with this example is the case of fetching many
> small objects. All objects will be stored, effectively, randomly on the
> disk, right? I worry about the scalability of that, because if you're
> accessing, say, thousands of objects a second that are scattered all over
> the disk, the OS disk cache will get blown, and every one of those lookups
> is going to require a seek. See, for example, the use of sequential primary

If you have this access pattern, the objects you are accessing are
probably *already* in RAM.
It's not impossible to have such a strange access pattern, that a
subset of objects are not used since days and at some point a *very
large amount* of objects are accessed in few seconds, but the best we
can do when this happens is to load data from disk into memory,
slowly.

VM is not magic :) its ability to work well is very related to the
amount of RAM you have VS the bias of your data access. If you have
evenly distributed accesses VM is entirely *useless*. If you have a
huge dataset and access just 0.01% of the keys most of time VM is
spectacularly useful. In the real world most of the time you are at
some point between this two extremes.

> keys to ensure temporal locality for
> friendfeed: http://bret.appspot.com/entry/how-friendfeed-uses-mysql
> With that discussion in mind, it might make sense to consider a primary key
> function for redis. It could be totally optional, of course, but if such
> keys are used, the vm could make an effort to lay things out on disk
> sequentially. Or maybe there's a clever way to do this behind the scenes?

Our design is simpler, and the good designer will have in mind how VM
works and will try to design a data model that will work very well
with it. Faster, less Redis code, Simpler, more control in the hands
of the developer. I don't claim this is the best we can do, it's just
my vision about Redis and I"ll try to control complexity in any way
needed.

> You might not need this if you can segment collections.
> I'm worried I don't understand your model well enough yet with respect to
> fragmentation. Let's say you have a timeline with 1M entries on it. It's
> stored on blocks 1213-1233 on disk. Now you want to push a new entry onto
> it. You peel the block off of disk, mark blocks 1213-1233 as free (right?),

Ok I'm following you, but why this should happen in the first instance
if the data layout is designed well?
If you have user timelines they are trimmed to 500 or maybe 1000
elements, not 1M. If you have 1M entries in a list, they this list is
about something important that gets used anyway, like the *global*
Retwis timeline, and this will never get swapped, as it is constantly
accessed.

Btw let's imagine there is a problem in the design and users have
infinite length timelines, and this timelines are not "segmented"
(hint, to segment a timeline in background is trivial using RPOPLPUSH,
atomically, a background process will be trivially able to split a
huge timeline into two pieces, one with N items, one with all the
rest. At application level it is trivial to implement pagination with
the two lists).

> keep it in memory for a while, and then eventually write it back out to
> disk, but it might not fit, in which case now it gets written to blocks
> 2342-2364. And with retwis, say there is an inactive users with 10k

The idea currently is to only write objects that can be stored in
contiguous blocks. I think at least in the first stage I'll not
support fragmentation at object-level (we'll still have fragmentation
at whole swap file level, as contiguos empty pages can be hard to
find, but I'll use algorithms similar to the one used by file systems
to avoid too much fragmentation).

> followers, 10% are inactive and have been swapped to disk... he tweets...
> now you have to peel 1k timelines off of disk, keep them in memory for a
> while (ie, push out a lot of other stuff), and then write them all back to
> new places on disk... am I missing it, or does that sound right?

Yes if you push data into lists that are on disk what happens is exactly this:
Redis will load them from memory, one every time you LPUSH. This
timelines will be taken into memory, so if the user tweets again in
short time they'll not be swapped on disk at all. If the user does not
tweet for a while (and any of the other followers of a few users),
this lists will be transfered on disk again *if* we are using too much
RAM.

I think that what you are missing here is that this is not a problem
with Redis VM. The problem is instead that most of this lists could
not be swapped on disk at all. But this is per-design, and I'm sure
we'll be smart enough to find the right solutions for this problems.

The first step is to release a synchronous implementation of VM. Given
that this will be blocking, it will be even simpler to see when a data
layout is not scaling well because of VM. We'll develop a small
culture about how to use efficiently VM, like we are developing
already a culture about Redis patterns. If the design will be flawed
on the real world, I'll reconsider it and code a different solution,
otherwise I'll go ahead implementing the non-blocking version of VM.

Thanks for sharing your ideas about VM. It's very interesting to think
at this stuff from a different point of view.

For instance we found how important RPOPLPUSH, conceived for a totally
different stuff, will be in the context of VM.

Cheers,
Salvatore

Steve Farrell

unread,

Jan 3, 2010, 4:05:51 PM1/3/10

to redi...@googlegroups.com

Hi Salvatore,

It sounds like we both agree that, in the case of a twitter-style timeline, keeping data from last week or last year in memory, simply in order to push a new update to the top of the timeline, would be very wasteful. I think the question, though, is whether the application designer or redis vm should be responsible for segmenting the timeline. Clearly it's not too much work from the application designers point of view, and that may well be the best answer... but I have to say I don't understand why yet... the two seem to be equivalent from a disk utilization and performance point of view.

As for the temporal-locality question... of course I understand that vm is not magic, but I think the point I'm making is that while all disk is slow, some layouts are slower than others, and that can make a big difference when dealing with huge amounts of data.

PS: I hope this discussion is productive to you and others... I don't mean to distract you from working on redis, or waste anyone's time. I'm frankly trying to think through whether it makes sense to use redis to store data that is not actively being accessed and if, in doing so, it would obviate a secondary disk-based storage component of my application. It would be profoundly valuable if redis could store unlimited amounts of data in a way that's efficient for many use cases... but it would be a very surprising outcome if it could do so without reproducing some of the work that's gone into other data management systems in terms of data layout on disk. You may be right, of course,... I'm still trying to wrap my head around that possibility...

Cheers

mmalex

unread,

Jan 5, 2010, 3:56:36 PM1/5/10

to Redis DB

hello all. just found this thread, had something to say about vm.
There is a story to why I am posting this which I have put at the end
of this post for those of you in a more reading-y mood :)

> Hello, in the Redis FAQ there is an entry about why this does not work
> with Redis.
> In the FAQ you'll find the details, but the short story is:
>
> a) in a single memory page you likely have Redis objects about a lot
> of different values. So maybe 10% of the whole 4096 bytes of a page is
> used, but the whole page will never get swapped.

this is interesting. but how about an alternative approach to VM that
fixes this problem? something I have been considering for my own
keyvalue store (developed over last year or so but sadly closed
source, see below), is to basically not do VM directly but instead
solve the problem you give above in the faq: for 'stale' data, instead
of writing it to disk, serialise it into compact 'VM friendly' pages
which you know that the VM system will swap out. in other words,
compact and compress your own memory image as you go, putting more CPU
cycles into squashing cold data, and leaving hot data 'unserialised',
and let the OS VM system swap out cold pages. I dont know enough about
how clever linux VM is, but in the spirit of the varnish guy, it seems
at least worth considering.
end of VM comment :)

the story and context to it :
I've been working on an in-memory persistent kv store for a year or
so, sadly closed source as I work on a PS3 project (LittleBigPlanet),
however I'm quite proud of the result and recently went to ScaleCamp
in london and gave a talk about it and compared it to redis. actually,
redis seems great and very close to my 'ideal' of how such a thing
should be made. fantastic! I am sure salvatore is a mad genius.
anyway, also reassuring that my independent work was somewhat similar/
validated once I stumbled on redis (as opposed to the 10000 dynamo
clones): I found out about it because I had independently come up with
the idea of forking and abusing copy-on-write to speedup my background
checkpointing, but it sounded crazy, Im not a linux guy, and I was
scared of what the overcommit issues might be. google for fork and
overcommit and database took me to redis, and after reading the source
a few months ago I have to say it's a great piece of work (though I
haven't used it, having something similar now inhouse). of all the KV
stores, I think it is the right way, you can't escape decent data
structures to do 'serious' interesting queries I think. my approach is
a little different, I have small 'atomic' values like 'the others' but
I construct lists/sets/join etc operations over ranges of keys with
lazy built indices, so similar result, and the replication works
differently.... more like a cluster, with keys sharded over machines
and then merged during set operations, but anyway I digress) in my
talk at scalecamp I mentioned that the main difference in philosophy
from redis, was my focus had been on smaller memory footprint over
speed, I said 'in an in memory database, you should worry about memory
efficiency'. I'm an embedded systems coder (actually GPU specialist)
so I also like that kind of thing. I can post somewhere else if people
are interested in some of the techniques used, its nothing complex,
but the result is good: I reckon I'm about 4x slower than redis (not
measured, just gut) but in return the mem footprint is typically
smaller than the dumb/raw serialised form of the data (ie 10x better
than memcached/redis). Simon Willison asked me at that time to post
here about it, maybe to kick salvatore because simon knew that
anything said here would prompt him to make it in redis :). 'as soon
as you know it's possible, it becomes easy' or something :) I didnt
post, because I got distracted back to GPU world, but ANYWAY I noticed
you had twittered about VM and that brought me to here again via
google. so I wanted to propose the idea you see at the top of this
mail.
thanks! sorry for the rambly email, I hope it was interesting and game
some context as to where the 'compact and leave it to the VM' idea
came from.

--
alex evans
http://bluespoon.com http://www.mediamolecule.com

Reply all

Reply to author

Forward