in the latest weeks I started to think more and more about datasets
bigger than the available memory. What I think is that many times the
size of the dataset gets overestimated, and actually even with 8GB of
RAM it's possible to run many kind of webapps with tons of users.
Still it's impossible to deny that there are workloads where a
key-aging algorithm could work, and this are indeed the workloads
where to have datasets bigger than RAM is useful.
For instance imagine a Redis dataset composed of 10 million of keys
containing mostly lists, with every list being of average length 1000
elements. this is a lot of data, but if this data has access patterns
not evenly distributed, and actually most of the times just 10% of
data is accessed (your most active users, for instance), to have a
dataset bigger than RAM makes sense because it's possible to take in
RAM only information that is more likely to being accessed. Ok it's
just the same concept of Virtual Memory of operating systems, but here
the granularity is not the hardware memory page, but the key itself.
Basically Redis could have a new config directive like this:
virtualmemory <bytes>
If this is set, it will start using the unused bytes of the redis
object structure in order to store aging information. Every time a key
is accessed we set this truncated timestamp. Aging will work even with
a three bytes timestamp (in seconds this are equivalent to 194 days:
(2^24)/(3600*24) = 194), basically it starts to be less accurate about
what page to swap when there are *many* pages not read or written for
more than 194 days, but there is a simple algorithm to fix this as
well (when testing for keys to swap out, when we find keys that are
near to overflow just fix this adding for instance 60 days, it is
anyway old and will likely be swapped).
Ok, now that there is aging information, every time Redis will start
to be low a few keys will be flushed on disk. Actually the key will
remain in memory, it's the *value* that gets swapped out. How this
keys are selected? By aging and size. A few will be tested, and the
one with the best combination of age and size will be swapped out (not
sure about the best algorithm but I guess that Swappability = Size*Age
is a decent start).
When the user will try to access the page it will get reloaded from
disk to memory. Likely we already have methods Redis calls every time
there is to access a function, so few changes are required to make
this working.
How the virtual memory is managed? My first idea is to take a bitmap
in memory representing pages in the swap file, so that it will be fast
to test for free pages. When selecting a free page the system will
make sure to select a page "near" to the latest accessed if possible
in order to enhance cache locality.
There is a problem with this otherwise interesting schema: datasets
with tons of keys with small values will not work well because anyway
we have to take the keys in memory and the small value will not save a
lot of memory when swapped out. This can be fixed using JSON to
represent objects with many fields in a single key, or better,
introducing the Hash type on Redis, so that I can set all the stuff
about my user ID 1000 in the hash user:1000 containing other values.
I think it's worth to implement this, but as usually I'm very
interested in your thoughts.
Cheers,
Salvatore
--
Salvatore 'antirez' Sanfilippo
http://invece.org
"Once you have something that grows faster than education grows,
you’re always going to get a pop culture.", Alan Kay
There is a problem with this otherwise interesting schema: datasets
with tons of keys with small values will not work well because anyway
we have to take the keys in memory and the small value will not save a
lot of memory when swapped out. This can be fixed using JSON to
represent objects with many fields in a single key, or better,
introducing the Hash type on Redis, so that I can set all the stuff
about my user ID 1000 in the hash user:1000 containing other values.
I think it's worth to implement this, but as usually I'm very
interested in your thoughts.
> the Hash. It's impossible to update only a few values within a serialized
> data structure like JSON, which forces you to break the structure apart into
> many different keys, each representing a subset of data that gets updated
> atomically.
Hello Andy, I agree, and there are other drawbacks as well as object
sharing not working.
> Maybe you're already thinking this, but it would seem to make sense to get
> 1.1 out the door with ZSETs and the SORT changes. A Hash type and virtual
> memory sound like a 1.2 or greater feature :)
Sure, it's not something for 1.1, a lot of rework is needed. Probably
nor for 1.2 where the Hash type will be introduced. The best thing to
do is probably to make an 1.3 where the only change is this because to
implement this feature well requires a bit more focus.
Thanks for sharing your ideas!
> I think it's worth to implement this, but as usually I'm very
> interested in your thoughts.
How do you plan backupping this data? What about using something not
proprietary but a library to store data on disk (like DBD)?
I have a blog search engine with about 20 GB of data of blog entries,
with a MySQL storage. Currently I'm saving all data to MySQL, but
using Redis to "cache" the fresh entries with EXPIRE, and also storing
the ids of fresh entries per category (lists are perfect for this). My
strategy is that I'm setting EXPIRE to two days, and if I'm not
finding something in Redis, I'm loading it from MySQL and storing it
to Redis for two hours.
I'm not sure that your idea about the virtual memory help me about
making this setup a bit less complicated, just shared a use case that
may give you some ideas.
Bye,
Andras
For store, we store all keys in memory, if value match on disk storage
conditions, then we store them on disk and put a pointer as value in
memory.
For access, if value not in memory, then load from disk.
The server could have some smarter tuning algorithm to giant best
performance, such as by default store all stuff in memory. When reach
the memory limitation, then swap big size/inactive values to disk. By
this way, server would have rare chance to running out of memory.
What do you think?
--Guo
> The server could have some smarter tuning algorithm to giant best
> performance, such as by default store all stuff in memory. When reach
> the memory limitation, then swap big size/inactive values to disk. By
> this way, server would have rare chance to running out of memory.
Hello Guo,
this is how this is going to work indeed. Under a given level of
memory, everything is taken in RAM, and the reverse is true also: over
a given bytes of memory used we'll swap on disk just the minimum
number of items, taken among the oldest/bigger, to reach our limit.
So in the long run this is going to be automatically optimized for
best performances.
I did some math and a sensible "page" size for the VM appears to be 256 bytes.
Cheers,
Salvatore
Exciting to see this happens. It could address most of the out of
memory fear without too much performance compromise :)
Thanks for you guys hard work.
-Guo
> Exciting to see this happens. It could address most of the out of
> memory fear without too much performance compromise :)
Assuming dataset accesses are not evenly distributed across al the keyspace ;)
> Thanks for you guys hard work.
Thanks,
> in the latest weeks I started to think more and more about datasets
> bigger than the available memory. What I think is that many times the
> size of the dataset gets overestimated, and actually even with 8GB of
> RAM it's possible to run many kind of webapps with tons of users.
> Still it's impossible to deny that there are workloads where a
> key-aging algorithm could work, and this are indeed the workloads
> where to have datasets bigger than RAM is useful.
Huge +1 from me. I would love for redis to be able to store more data
then it can take in physical memory. With this feature I would
probably use redis for all my persistence.
Cheers-
Ezra Zygmuntowicz
e...@engineyard.com
Another huge +1.
I'm using Redis for all my persistence, so imagine how strongly I
favor this feature.
--
Michel
As long as we can also remove/disable some feature when needed, to
keep it lightweight.
*grins*
--
F4FQM
Kerunix Flan
Laurent Laborde
Hello Laurent, Grégoire,
sure it will be an option disabled by default.
> Another huge +1 from me.
>
> I wonder shouldn't it be a builtin feature of modern operating
> systems, like ability for application to say to OS, "swap this memory
> area as soon as you can". Or as soon as my (application) memory usage
> grows beyond N pages or something like. Maybe linux/freebsd/darwin
> have it?
Hello Sergey,
I bet this is going to be a FAQ because actually in theory it should
be this way. So I write here in the form of a FAQ to cut&paste inside
the Wiki page.
=Do you plan to implement Virtual Memory in Redis? Why don't just let
the Operating System handle it with?=
Yes, in order to support datasets bigger than RAM there is the plan to
implement transparent Virtual Memory in Redis, that is, the ability to
transfer large values associated to keys rarely used on Disk, and
reload them transparently in memory when this values are requested in
some way.
So you may ask why don't let the operating system VM do the work for
us. There are two main reasons: in Redis even a large value stored at
a given key, for instance a 1 million elements list, is not allocated
in a contiguous piece of memory. It's actually *very* fragmented since
Redis uses quite aggressive object sharing and allocated Redis Objects
structures reuse.
So you can imagine the memory layout composed of 4096 bytes pages that
actually contain different parts of different large values. Not only,
but a lot of values that are large enough for us to swap out to disk,
like a 1024k value, is just one quarter the size of a memory page, and
likely in the same page there are other values that are not rarely
used. So this value wil never be swapped out by the operating system.
This is the first reason for implementing application-level virtual
memory in Redis.
There is another one, as important as the first. A complex object in
memory like a list or a set is something *10 times bigger* than the
same object serialized on disk. Probably you already noticed how Redis
snapshots on disk are damn smaller compared to the memory usage of
Redis for the same objects. This happens because when data is in
memory is full of pointers, reference counters and other metadata. Add
to this malloc fragmentation and need to return word-aligned chunks of
memory and you have a clear picture of what happens. So this means to
have 10 times the I/O between memory and disk than otherwise needed.
Hope this helps,
> =Do you plan to implement Virtual Memory in Redis? Why don't just let
> the Operating System handle it with?=
Sorry that's "handle it for you" of course.
Anything on disk will be much slower. But as discussed, only selected
*inactive/big* items could be swaped to disk, so won't affect too much
for *hot* high performance items. By this way, redis could make use of
available memory while handle much larger data set than physical
memory.
-Guo