datasets bigger than RAM, or why we need an hash type.

71 views
Skip to first unread message

Salvatore Sanfilippo

unread,
Nov 3, 2009, 4:04:24 AM11/3/09
to Redis DB
Hello,

in the latest weeks I started to think more and more about datasets
bigger than the available memory. What I think is that many times the
size of the dataset gets overestimated, and actually even with 8GB of
RAM it's possible to run many kind of webapps with tons of users.
Still it's impossible to deny that there are workloads where a
key-aging algorithm could work, and this are indeed the workloads
where to have datasets bigger than RAM is useful.

For instance imagine a Redis dataset composed of 10 million of keys
containing mostly lists, with every list being of average length 1000
elements. this is a lot of data, but if this data has access patterns
not evenly distributed, and actually most of the times just 10% of
data is accessed (your most active users, for instance), to have a
dataset bigger than RAM makes sense because it's possible to take in
RAM only information that is more likely to being accessed. Ok it's
just the same concept of Virtual Memory of operating systems, but here
the granularity is not the hardware memory page, but the key itself.

Basically Redis could have a new config directive like this:

virtualmemory <bytes>

If this is set, it will start using the unused bytes of the redis
object structure in order to store aging information. Every time a key
is accessed we set this truncated timestamp. Aging will work even with
a three bytes timestamp (in seconds this are equivalent to 194 days:
(2^24)/(3600*24) = 194), basically it starts to be less accurate about
what page to swap when there are *many* pages not read or written for
more than 194 days, but there is a simple algorithm to fix this as
well (when testing for keys to swap out, when we find keys that are
near to overflow just fix this adding for instance 60 days, it is
anyway old and will likely be swapped).

Ok, now that there is aging information, every time Redis will start
to be low a few keys will be flushed on disk. Actually the key will
remain in memory, it's the *value* that gets swapped out. How this
keys are selected? By aging and size. A few will be tested, and the
one with the best combination of age and size will be swapped out (not
sure about the best algorithm but I guess that Swappability = Size*Age
is a decent start).

When the user will try to access the page it will get reloaded from
disk to memory. Likely we already have methods Redis calls every time
there is to access a function, so few changes are required to make
this working.

How the virtual memory is managed? My first idea is to take a bitmap
in memory representing pages in the swap file, so that it will be fast
to test for free pages. When selecting a free page the system will
make sure to select a page "near" to the latest accessed if possible
in order to enhance cache locality.

There is a problem with this otherwise interesting schema: datasets
with tons of keys with small values will not work well because anyway
we have to take the keys in memory and the small value will not save a
lot of memory when swapped out. This can be fixed using JSON to
represent objects with many fields in a single key, or better,
introducing the Hash type on Redis, so that I can set all the stuff
about my user ID 1000 in the hash user:1000 containing other values.

I think it's worth to implement this, but as usually I'm very
interested in your thoughts.

Cheers,
Salvatore

--
Salvatore 'antirez' Sanfilippo
http://invece.org

"Once you have something that grows faster than education grows,
you’re always going to get a pop culture.", Alan Kay

Andy McCurdy

unread,
Nov 3, 2009, 5:00:43 AM11/3/09
to redi...@googlegroups.com


On Tue, Nov 3, 2009 at 1:04 AM, Salvatore Sanfilippo <ant...@gmail.com> wrote:

Hello,


There is a problem with this otherwise interesting schema: datasets
with tons of keys with small values will not work well because anyway
we have to take the keys in memory and the small value will not save a
lot of memory when swapped out. This can be fixed using JSON to
represent objects with many fields in a single key, or better,
introducing the Hash type on Redis, so that I can set all the stuff
about my user ID 1000 in the hash user:1000 containing other values.


I think a Hash type in Redis is needed for something like this to work well.  The most powerful component of the Hash type will be the ability to update multiple key/values inside the hash atomically -- e.g. MSET for keys within the Hash.  It's impossible to update only a few values within a serialized data structure like JSON, which forces you to break the structure apart into many different keys, each representing a subset of data that gets updated atomically.  
 
I think it's worth to implement this, but as usually I'm very
interested in your thoughts.


Maybe you're already thinking this, but it would seem to make sense to get 1.1 out the door with ZSETs and the SORT changes.  A Hash type and virtual memory sound like a 1.2 or greater feature :)


Salvatore Sanfilippo

unread,
Nov 3, 2009, 5:05:15 AM11/3/09
to redi...@googlegroups.com
On Tue, Nov 3, 2009 at 11:00 AM, Andy McCurdy <sed...@gmail.com> wrote:

> the Hash.  It's impossible to update only a few values within a serialized
> data structure like JSON, which forces you to break the structure apart into
> many different keys, each representing a subset of data that gets updated
> atomically.

Hello Andy, I agree, and there are other drawbacks as well as object
sharing not working.

> Maybe you're already thinking this, but it would seem to make sense to get
> 1.1 out the door with ZSETs and the SORT changes.  A Hash type and virtual
> memory sound like a 1.2 or greater feature :)

Sure, it's not something for 1.1, a lot of rework is needed. Probably
nor for 1.2 where the Hash type will be introduced. The best thing to
do is probably to make an 1.3 where the only change is this because to
implement this feature well requires a bit more focus.

Thanks for sharing your ideas!

András Bártházi

unread,
Nov 3, 2009, 5:23:33 AM11/3/09
to redi...@googlegroups.com
Hi,

> I think it's worth to implement this, but as usually I'm very
> interested in your thoughts.

How do you plan backupping this data? What about using something not
proprietary but a library to store data on disk (like DBD)?

I have a blog search engine with about 20 GB of data of blog entries,
with a MySQL storage. Currently I'm saving all data to MySQL, but
using Redis to "cache" the fresh entries with EXPIRE, and also storing
the ids of fresh entries per category (lists are perfect for this). My
strategy is that I'm setting EXPIRE to two days, and if I'm not
finding something in Redis, I'm loading it from MySQL and storing it
to Redis for two hours.

I'm not sure that your idea about the virtual memory help me about
making this setup a bit less complicated, just shared a use case that
may give you some ideas.

Bye,
Andras

Salvatore Sanfilippo

unread,
Nov 3, 2009, 5:44:24 AM11/3/09
to redi...@googlegroups.com
2009/11/3 András Bártházi <barthaz...@gmail.com>:

> I'm not sure that your idea about the virtual memory help me about
> making this setup a bit less complicated, just shared a use case that
> may give you some ideas.

Hello Andras,

when using Redis as a cache the virtual memory is not needed at all
(will just ruin the performances). When using it as a database the
virtual memory is completely transparent, as it's just the illusion of
more memory. Redis will continue to save the dataset via snapshotting
of using the new append-only journal accordingly to the configuration
even when the virtual memory file will be in use.

Actually the VM file will just be created under /tmp, opened by Redis,
and unlink(2)ed so it will never be visible to the user.

Cheers,
Salvatore

> Bye,
>  Andras

Guo Du

unread,
Nov 3, 2009, 6:14:33 AM11/3/09
to redi...@googlegroups.com
For datasets bigger than RAM, we may store some of value to disk
instead of in memory. Following two type of values may taken most of
the memory.
1. Less accessed values (such as inactive user information)
2. Big size values (such as big binary files)

For store, we store all keys in memory, if value match on disk storage
conditions, then we store them on disk and put a pointer as value in
memory.
For access, if value not in memory, then load from disk.

The server could have some smarter tuning algorithm to giant best
performance, such as by default store all stuff in memory. When reach
the memory limitation, then swap big size/inactive values to disk. By
this way, server would have rare chance to running out of memory.

What do you think?

--Guo

Salvatore Sanfilippo

unread,
Nov 3, 2009, 6:26:36 AM11/3/09
to redi...@googlegroups.com
On Tue, Nov 3, 2009 at 12:14 PM, Guo Du <mrd...@gmail.com> wrote:

> The server could have some smarter tuning algorithm to giant best
> performance, such as by default store all stuff in memory. When reach
> the memory limitation, then swap big size/inactive values to disk. By
> this way, server would have rare chance to running out of memory.

Hello Guo,

this is how this is going to work indeed. Under a given level of
memory, everything is taken in RAM, and the reverse is true also: over
a given bytes of memory used we'll swap on disk just the minimum
number of items, taken among the oldest/bigger, to reach our limit.

So in the long run this is going to be automatically optimized for
best performances.

I did some math and a sensible "page" size for the VM appears to be 256 bytes.

Cheers,
Salvatore

Guo Du

unread,
Nov 3, 2009, 6:38:51 AM11/3/09
to redi...@googlegroups.com
On Tue, Nov 3, 2009 at 11:26 AM, Salvatore Sanfilippo <ant...@gmail.com> wrote:
> this is how this is going to work indeed. Under a given level of
> memory, everything is taken in RAM, and the reverse is true also: over
> a given bytes of memory used we'll swap on disk just the minimum
> number of items, taken among the oldest/bigger, to reach our limit.
>
> So in the long run this is going to be automatically optimized for
> best performances.

Exciting to see this happens. It could address most of the out of
memory fear without too much performance compromise :)

Thanks for you guys hard work.

-Guo

pre...@gmail.com

unread,
Nov 3, 2009, 6:49:47 AM11/3/09
to Redis DB
Paying attention to Redis since 0.900 I spot a pattern showing that
Redis is becoming much more of a DBMS than it was intended to in the
first place. In no way should anyone consider this disadvantage
because that simple evolution process is just remarkable.

My background are relational databases, with large focus on Oracle
engine. Oracle was around since 1977 so this is a mature and advanced
technology. On the other hand Redis came out of trends and
requirements of web 2.0 and development of hosted/cloud services.
Having a large enterprise database on the one hand and lightweight key-
value datastore on the other, it's very interesting to see where these
two meet. The "virtualmemory" you describe is a concept of Oracle's
SGA, the sorted data type is something that has existed in databases
for a long time, in form of a b*tree index. The similarities do not
end here. The difference is that Redis remains simple, ant let it stay
that way. Maybe a brief look at Oracle's architecture is worthwhile
for further progress?

Salvatore Sanfilippo

unread,
Nov 3, 2009, 6:54:03 AM11/3/09
to redi...@googlegroups.com
On Tue, Nov 3, 2009 at 12:38 PM, Guo Du <mrd...@gmail.com> wrote:

> Exciting to see this happens. It could address most of the out of
> memory fear without too much performance compromise :)

Assuming dataset accesses are not evenly distributed across al the keyspace ;)

> Thanks for you guys hard work.

Thanks,

Salvatore Sanfilippo

unread,
Nov 3, 2009, 7:05:07 AM11/3/09
to redi...@googlegroups.com
On Tue, Nov 3, 2009 at 12:49 PM, pre...@gmail.com <pre...@gmail.com> wrote:
>
> Paying attention to Redis since 0.900 I spot a pattern showing that
> Redis is becoming much more of a DBMS than it was intended to in the
> first place. In no way should anyone consider this disadvantage
> because that simple evolution process is just remarkable.

As far as we can follow a path that allows Redis to "scale down" I
guess it's ok. For instance VM should be a feature not enabled by
default that the user can be even unaware of, at least until his
memory requirements are low.
This is not always simple or possible of course.

> engine. Oracle was around since 1977 so this is a mature and advanced

Well I *born* in 1977 ;) it's impressive to think that Oracle was
there. I thought Oracle to be much more young (~ mid 90s).

> technology. On the other hand Redis came out of trends and
> requirements of web 2.0 and development of hosted/cloud services.
> Having a large enterprise database on the one hand and lightweight key-
> value datastore on the other, it's very interesting to see where these
> two meet. The "virtualmemory" you describe is a concept of Oracle's
> SGA, the sorted data type is something that has existed in databases
> for a long time, in form of a b*tree index. The similarities do not
> end here. The difference is that Redis remains simple, ant let it stay
> that way. Maybe a brief look at Oracle's architecture is worthwhile
> for further progress?

I was shocked the first time I read Andrew Tanenbaum's book on
operating systems about how clever the virtual memory implementation
used to be in early operating systems (the fact itself the concept was
there so early is impressive). Similarly almost everything is inside
Redis is based on algorithms that are around since decades indeed. So
indeed it seems like in computer science the evolution of ideas is not
fast at all, it's just a technological matter I guess.

So indeed it makes sense to check what was done in the past and try to
get the best ideas. Still there is at least a field where was is
needed is to look at the future, that is the eventually consistency
algorithms: in this field there is still probably a lot to say...

Ezra Zygmuntowicz

unread,
Nov 3, 2009, 10:21:55 AM11/3/09
to redi...@googlegroups.com

On Nov 3, 2009, at 1:04 AM, Salvatore Sanfilippo wrote:

> in the latest weeks I started to think more and more about datasets
> bigger than the available memory. What I think is that many times the
> size of the dataset gets overestimated, and actually even with 8GB of
> RAM it's possible to run many kind of webapps with tons of users.
> Still it's impossible to deny that there are workloads where a
> key-aging algorithm could work, and this are indeed the workloads
> where to have datasets bigger than RAM is useful.

Huge +1 from me. I would love for redis to be able to store more data
then it can take in physical memory. With this feature I would
probably use redis for all my persistence.


Cheers-
Ezra Zygmuntowicz
e...@engineyard.com

Michel Martens

unread,
Nov 3, 2009, 10:28:53 AM11/3/09
to redi...@googlegroups.com

Another huge +1.

I'm using Redis for all my persistence, so imagine how strongly I
favor this feature.

--
Michel

Laurent Laborde

unread,
Nov 3, 2009, 10:35:06 AM11/3/09
to redi...@googlegroups.com
While i strongly agree with preprio about : "Redis is becoming much

more of a DBMS than it was intended to in the
first place"...
... I also totally want this virtual memory feature !

As long as we can also remove/disable some feature when needed, to
keep it lightweight.
*grins*

--
F4FQM
Kerunix Flan
Laurent Laborde

Grégoire Welraeds

unread,
Nov 3, 2009, 11:22:19 AM11/3/09
to redi...@googlegroups.com
Once again, I have to agree with Laurent :-). For my redis based project (approximate name searching), I need everything into memory to keep response time reasonable. My dataset does only change when putting new names in memory. If I don't have sufficient memory, I prefer to get an error from Redis than having behind  the scene swapping while I'm searching names. So, please, keep it as an option.

thanks.
Greg.

Salvatore Sanfilippo

unread,
Nov 3, 2009, 11:31:18 AM11/3/09
to redi...@googlegroups.com
2009/11/3 Grégoire Welraeds <gregoire...@gmail.com>:

> Once again, I have to agree with Laurent :-). For my redis based project
> (approximate name searching), I need everything into memory to keep response
> time reasonable. My dataset does only change when putting new names in
> memory. If I don't have sufficient memory, I prefer to get an error from
> Redis than having behind  the scene swapping while I'm searching names. So,
> please, keep it as an option.

Hello Laurent, Grégoire,

sure it will be an option disabled by default.

Sergey Shepelev

unread,
Nov 4, 2009, 8:37:09 AM11/4/09
to redi...@googlegroups.com
Another huge +1 from me.

I wonder shouldn't it be a builtin feature of modern operating
systems, like ability for application to say to OS, "swap this memory
area as soon as you can". Or as soon as my (application) memory usage
grows beyond N pages or something like. Maybe linux/freebsd/darwin
have it?

Salvatore Sanfilippo

unread,
Nov 4, 2009, 9:24:34 AM11/4/09
to redi...@googlegroups.com
On Wed, Nov 4, 2009 at 2:37 PM, Sergey Shepelev <tem...@gmail.com> wrote:

> Another huge +1 from me.
>
> I wonder shouldn't it be a builtin feature of modern operating
> systems, like ability for application to say to OS, "swap this memory
> area as soon as you can". Or as soon as my (application) memory usage
> grows beyond N pages or something like. Maybe linux/freebsd/darwin
> have it?

Hello Sergey,

I bet this is going to be a FAQ because actually in theory it should
be this way. So I write here in the form of a FAQ to cut&paste inside
the Wiki page.

=Do you plan to implement Virtual Memory in Redis? Why don't just let
the Operating System handle it with?=

Yes, in order to support datasets bigger than RAM there is the plan to
implement transparent Virtual Memory in Redis, that is, the ability to
transfer large values associated to keys rarely used on Disk, and
reload them transparently in memory when this values are requested in
some way.

So you may ask why don't let the operating system VM do the work for
us. There are two main reasons: in Redis even a large value stored at
a given key, for instance a 1 million elements list, is not allocated
in a contiguous piece of memory. It's actually *very* fragmented since
Redis uses quite aggressive object sharing and allocated Redis Objects
structures reuse.

So you can imagine the memory layout composed of 4096 bytes pages that
actually contain different parts of different large values. Not only,
but a lot of values that are large enough for us to swap out to disk,
like a 1024k value, is just one quarter the size of a memory page, and
likely in the same page there are other values that are not rarely
used. So this value wil never be swapped out by the operating system.
This is the first reason for implementing application-level virtual
memory in Redis.

There is another one, as important as the first. A complex object in
memory like a list or a set is something *10 times bigger* than the
same object serialized on disk. Probably you already noticed how Redis
snapshots on disk are damn smaller compared to the memory usage of
Redis for the same objects. This happens because when data is in
memory is full of pointers, reference counters and other metadata. Add
to this malloc fragmentation and need to return word-aligned chunks of
memory and you have a clear picture of what happens. So this means to
have 10 times the I/O between memory and disk than otherwise needed.

Hope this helps,

Salvatore Sanfilippo

unread,
Nov 4, 2009, 9:25:20 AM11/4/09
to redi...@googlegroups.com
On Wed, Nov 4, 2009 at 3:24 PM, Salvatore Sanfilippo <ant...@gmail.com> wrote:

> =Do you plan to implement Virtual Memory in Redis? Why don't just let
> the Operating System handle it with?=

Sorry that's "handle it for you" of course.

Sergey Shepelev

unread,
Nov 4, 2009, 10:45:35 AM11/4/09
to redi...@googlegroups.com
Yeah, thanks for detailed explanation.

For the first reason about sparse data, i think some kind of
compacting would help. Some kind of compacting would help performance
even without considering VM anyway.
But for the second about metadata, i guess there is no really good help.

Chris M. Welsh

unread,
Nov 9, 2009, 3:43:49 PM11/9/09
to Redis DB
Hello,

I am a big fan of Redis, simply because not many other "key/value"
databases have support for the data structures found in redis.

With that said, is it possible this is going beyond the scope of the
project? For example, the SUNION and SINTER commands are great because
they work on the data in memory, which makes them very fast. If the
data was residing in a mix of memory/disk storage, this could slow
things down greatly, no?

I would suggest making a companion program to run alongside redis, to
handle the paging out of unneeded data. In my use case, I know there
is data that will usually be used for a day at most, then needs to be
stored permanently with much lesser chance of access. There's other
data that I want to keep in memory constantly for fast operations. The
secondary program could handle the paging of the data and keep redis a
pure fast memory storage.

Sincerely,
Chris
> Salvatore 'antirez' Sanfilippohttp://invece.org

Guo Du

unread,
Nov 9, 2009, 4:46:10 PM11/9/09
to redi...@googlegroups.com
On Mon, Nov 9, 2009 at 8:43 PM, Chris M. Welsh <stop...@gmail.com> wrote:
> data was residing in a mix of memory/disk storage, this could slow
> things down greatly, no?

Anything on disk will be much slower. But as discussed, only selected
*inactive/big* items could be swaped to disk, so won't affect too much
for *hot* high performance items. By this way, redis could make use of
available memory while handle much larger data set than physical
memory.

-Guo

Thibaut

unread,
Nov 23, 2009, 2:35:36 PM11/23/09
to Redis DB
+1 from me.

This would definitely be very useful for larger values. Using an SSD
disk as storage would certainly speed up the retrieval process as
well.

Ian

unread,
Nov 23, 2009, 3:11:42 PM11/23/09
to redi...@googlegroups.com
+1
> --
>
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=.
>
>
>

wqhhust

unread,
Nov 24, 2009, 8:22:52 AM11/24/09
to Redis DB
+1, it could replace mysql in some use case
Reply all
Reply to author
Forward
0 new messages