a few months after VM started to work my feeling about it started to be not very good. I stated in the blog, and privately on IRC, especially talking with Pieter, that VM was not the way to go for the future of Redis, and that the new path we were taking about using less memory was a much better approach. Together with cluster.
However there are a number of different models for dealing with datasets bigger than RAM for Redis. Just to cite a few:
1) virtual memory, where we swap values on disk as needed (The Redis Virtual Memory way) 2) storing data on disk, in a complex form so that operations can be implemented directly in the on-disk representation, and using the OS cache as a cache layer for the working set (let's call it the Mongo DB way) 3) storing data on disk, but not for direct manipulation, and use memory as a cache of objects that are active, flushing writes on disks when this objects change.
It is now clear that VM is not the right set of tradeoffs, it was designed to be pretty fast but on the other hand there was a too big price to pay for all the rest: slow restarts, slow saving, and in turn slow replication, very complex code, and so forth.
If you want pure speed with Redis, in memory is the way to go. So as a reaction to the email sent by Tim about his unhappiness with VM I used a few vacation days to start implementing a new model, that is was was listed above as number "3".
The new set of tradeoffs are very different. The result is called diskstore, and this is how it works, in a few easy to digest points.
- In diskstore key-value paris are stored on disk. - Memory works as a cache for live objects. Operations are only performed on in memory keys, so data on disk does not need to be stored in complex forms. - The cache-max-memory limit is strict. Redis will never use more RAM, even if we have 2 MB of max memory and 1 billion of keys. This works since now we don't need to take keys in memory. - Data is flushed on disk asynchronously. If a key is marked as dirty, and IO operation is scheduled for this key. - You can control the delay between modifications of keys and disk writes, so that if a key is modified a lot of time in small time, it will written only one time on disk. - Setting the delay to 0 means, sync it as fast as possible. - All I/O is performed by a single dedicated thread, that is long-running and not spawned on demand. The thread is awaked with a conditional variable. - The system is much simpler and sane than VM implementation, as there is no need to "undo" operations on race conditions. - Zero start-up time... as objects are loaded on demand. - There is negative caching. If a key is not on disk we remember it (if there is memory to do so). So we avoid accessing the disk again and again for keys that are not there. - The system is very fast if we access mostly our working set, and this working set happens to fit memory. Otherwise the system is much slower (I/O bound). - The system does not support BGSAVE currently, but will support this, and what is cool, with minimal overhead and used memory in the saving child, as data on disk is already written using the same serialization format as .rdb files. So our child will just copy files to obtain the .rdb. In the mean time the objects in cache are not flushed, so the system may use more memory, but it's not about copy-on-write, so it will use very very little additional memory. - Persistence is *PER KEY* this means, there is no point in time persistence.
I think that the above points may give you an idea about how it works. But let me stress the per-key persistence point a bit.
LPUSH a 0 LPUSH b 1 LPUHS a 2
So after this commands we may have two IO scheduled operations pending. One for "a" and one for "b". Now imagine "a" is saved, and then the server goes down, Redis is brutally killed, or alike. The database will contain a consistent version of "a" and "b", but the version of "b" will be old, without the "1" pushed.
Also currently MULTI/EXEC is not transactional, but this will be fixed, at least inside a multi/exec there will be guarantee that either all values or nothing will be synched to disk (this will be obtained using a journal file for transactions).
Some more details. The system is composed of two layers:
diskstore.c -- implements a trivial on disk key-value store dscache.c -- implements the more complex caching layer
diskstore.c is currently a filesystem based KV store. This can be replaced with a B-TREE or something like that in the future if this will be needed. However even if the current implementation has a big overhead, it's pretty cool to have data as files, with very little chances of loosing data and corruption (rename is used for writes). But well if this does not scale well enough we'll drop it and replace it with something better.
The current implementations is similar to bigdis. 256 directories containing 256 directories each are used, for a total of 65536 dir. Every key is put inside the dirs addressed by SHA1(key) translated in hex, for instance key "foo" is at:
/0b/ee/0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33
The cool thing is, diskstore.c exports a trivial interface to Redis, so it's very simple to replace with something else without touching too much internals.
Stability: the system is obviously in alpha stage, however it works pretty well, without obvious crashes. But warning, it will crash with an assert if you try to BGSAVE.
To try it download the "unstable" branch, edit redis.conf to enable diskstore. Play with it. Enjoy a redis instance that starts in no time even when it's full of data :)
Feedbacks are really really appreciated here. I want to know what you think, what are your impressions on the design, tradeoffs, and so forth, how it feels when you experiment with it. If you want to see the inner workings set log level to "debug".
The goal is to ship 2.4 ASAP with VM replaced with a good implementation of diskstore.
I've definitely had issues with VM, and am excited to see how diskstore works out. In theory, the tradeoffs seem much better. Couple questions.
What happens if I have a producer (or perhaps many producers) of data that add dirty data to Redis faster than the Diskstore IO thread can write it? Specifically, what if the dirty data approaches the max memory limits? Will the memory footprint grow temporarily until those values are persisted, then free itself?
What about if I have a large data structure? One with thousands of items in a list or zset. If this key is written to frequently (multiple times a second), will the IO thread get backed up by having to write such a large structure over and over again?
-andy
On Tue, Jan 4, 2011 at 11:57 AM, Salvatore Sanfilippo <anti...@gmail.com>wrote:
> a few months after VM started to work my feeling about it started to > be not very good. I stated in the blog, and privately on IRC, > especially talking with Pieter, that VM was not the way to go for the > future of Redis, and that the new path we were taking about using less > memory was a much better approach. Together with cluster.
> However there are a number of different models for dealing with > datasets bigger than RAM for Redis. Just to cite a few:
> 1) virtual memory, where we swap values on disk as needed (The Redis > Virtual Memory way) > 2) storing data on disk, in a complex form so that operations can be > implemented directly in the on-disk representation, and using the OS > cache as a cache layer for the working set (let's call it the Mongo DB > way) > 3) storing data on disk, but not for direct manipulation, and use > memory as a cache of objects that are active, flushing writes on disks > when this objects change.
> It is now clear that VM is not the right set of tradeoffs, it was > designed to be pretty fast but on the other hand there was a too big > price to pay for all the rest: slow restarts, slow saving, and in turn > slow replication, very complex code, and so forth.
> If you want pure speed with Redis, in memory is the way to go. So as a > reaction to the email sent by Tim about his unhappiness with VM I used > a few vacation days to start implementing a new model, that is was was > listed above as number "3".
> The new set of tradeoffs are very different. The result is called > diskstore, and this is how it works, in a few easy to digest points.
> - In diskstore key-value paris are stored on disk. > - Memory works as a cache for live objects. Operations are only > performed on in memory keys, so data on disk does not need to be > stored in complex forms. > - The cache-max-memory limit is strict. Redis will never use more RAM, > even if we have 2 MB of max memory and 1 billion of keys. This works > since now we don't need to take keys in memory. > - Data is flushed on disk asynchronously. If a key is marked as dirty, > and IO operation is scheduled for this key. > - You can control the delay between modifications of keys and disk > writes, so that if a key is modified a lot of time in small time, it > will written only one time on disk. > - Setting the delay to 0 means, sync it as fast as possible. > - All I/O is performed by a single dedicated thread, that is > long-running and not spawned on demand. The thread is awaked with a > conditional variable. > - The system is much simpler and sane than VM implementation, as there > is no need to "undo" operations on race conditions. > - Zero start-up time... as objects are loaded on demand. > - There is negative caching. If a key is not on disk we remember it > (if there is memory to do so). So we avoid accessing the disk again > and again for keys that are not there. > - The system is very fast if we access mostly our working set, and > this working set happens to fit memory. Otherwise the system is much > slower (I/O bound). > - The system does not support BGSAVE currently, but will support this, > and what is cool, with minimal overhead and used memory in the saving > child, as data on disk is already written using the same serialization > format as .rdb files. So our child will just copy files to obtain the > .rdb. In the mean time the objects in cache are not flushed, so the > system may use more memory, but it's not about copy-on-write, so it > will use very very little additional memory. > - Persistence is *PER KEY* this means, there is no point in time > persistence.
> I think that the above points may give you an idea about how it works. > But let me stress the per-key persistence point a bit.
> LPUSH a 0 > LPUSH b 1 > LPUHS a 2
> So after this commands we may have two IO scheduled operations > pending. One for "a" and one for "b". > Now imagine "a" is saved, and then the server goes down, Redis is > brutally killed, or alike. The database will contain a consistent > version of "a" and "b", but the version of "b" will be old, without > the "1" pushed.
> Also currently MULTI/EXEC is not transactional, but this will be > fixed, at least inside a multi/exec there will be guarantee that > either all values or nothing will be synched to disk (this will be > obtained using a journal file for transactions).
> Some more details. The system is composed of two layers:
> diskstore.c -- implements a trivial on disk key-value store > dscache.c -- implements the more complex caching layer
> diskstore.c is currently a filesystem based KV store. This can be > replaced with a B-TREE or something like that in the future if this > will be needed. However even if the current implementation has a big > overhead, it's pretty cool to have data as files, with very little > chances of loosing data and corruption (rename is used for writes). > But well if this does not scale well enough we'll drop it and replace > it with something better.
> The current implementations is similar to bigdis. 256 directories > containing 256 directories each are used, for a total of 65536 dir. > Every key is put inside the dirs addressed by SHA1(key) translated in > hex, for instance key "foo" is at:
> /0b/ee/0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33
> The cool thing is, diskstore.c exports a trivial interface to Redis, > so it's very simple to replace with something else without touching > too much internals.
> Stability: the system is obviously in alpha stage, however it works > pretty well, without obvious crashes. But warning, it will crash with > an assert if you try to BGSAVE.
> To try it download the "unstable" branch, edit redis.conf to enable > diskstore. Play with it. Enjoy a redis instance that starts in no time > even when it's full of data :)
> Feedbacks are really really appreciated here. I want to know what you > think, what are your impressions on the design, tradeoffs, and so > forth, how it feels when you experiment with it. If you want to see > the inner workings set log level to "debug".
> The goal is to ship 2.4 ASAP with VM replaced with a good > implementation of diskstore.
> "We are what we repeatedly do. Excellence, therefore, is not an act, > but a habit." -- Aristotele
> -- > You received this message because you are subscribed to the Google Groups > "Redis DB" group. > To post to this group, send email to redis-db@googlegroups.com. > To unsubscribe from this group, send email to > redis-db+unsubscribe@googlegroups.com<redis-db%2Bunsubscribe@googlegroups.c om> > . > For more options, visit this group at > http://groups.google.com/group/redis-db?hl=en.
To share some quick, high level thoughts on my initial tests and what
I've seen so far...
I've been starting off with a basic test that essentially spams simple
SET requests to a redis instance with small (1-30 bytes or so) keys
and values as fast as it can go. I've been using a 2-cpu (for a total
of 16 cores) Xeon X5570 with 24GB of RAM and, at the moment, a solid
state drive that I'm using for long term storage.
* Once redis reaches the cache RAM limit, the machine starts thrashing
pretty bad, and ends up spending a good deal of time waiting for disk
i/o and generally slowing redis to a halt. The SETs per second at
that point quickly drops of from about 25-26k to 10k, and then to
nothing as it thrashes back and forth.
* I haven't spent as much time testing this out, but increasing cache-
flush-delay seems to decrease the "hit" that occurs once we reach the
RAM limit, but that success is short lived :)
I will freely admit that my test is a little unfair at best, and
unrealistic at worst -- obviously in a production scenario I wouldn't
be using redis with a disk backing store simply to constantly issue
SET requests. However, I've been checking out various options to do
more or less exactly what the future redis v2.4 aims to do, and "what
happens when you run out of RAM" is a question that I need to answer,
and the answer, in most cases, seems to be "noooooo!!!!!!!" :). Not
entirely surprising given the stress I'm putting on it, but I need to
know what happens in worst case scenarios.
That being said, while having something that magically experienced
zero performance degradation when writing to disk instead of RAM would
be just lovely, I'm not gonna hold my breath :). I am looking for
consistent and predictable performance characteristics, however. At
the moment, redis effectively locks up if it's under heavy load, has
reached the cache size limit, and the db is continuing to grow.
I'll be continuing some of my testing in the near future to include
some scenarios that are a bit more realistic, mixing GET/SET/etc.
requests in certain ratios and so forth, using things that are at
least closer to "real world" data (for us, anyway).
Depending on how far I get, I might offer up a branch with a few
tweaks of my own, but that depends largely on how much time I can
devote to it in the near future :).
To close, I'll just say that for being as new and early stage as the
branch formerly known as 'diskstore' is, it's looking pretty good.
Thanks for all the hard work!
-Chris
On Jan 4, 2:57 pm, Salvatore Sanfilippo <anti...@gmail.com> wrote:
> a few months after VM started to work my feeling about it started to
> be not very good. I stated in the blog, and privately on IRC,
> especially talking with Pieter, that VM was not the way to go for the
> future of Redis, and that the new path we were taking about using less
> memory was a much better approach. Together with cluster.
> However there are a number of different models for dealing with
> datasets bigger than RAM for Redis. Just to cite a few:
> 1) virtual memory, where we swap values on disk as needed (The Redis
> Virtual Memory way)
> 2) storing data on disk, in a complex form so that operations can be
> implemented directly in the on-disk representation, and using the OS
> cache as a cache layer for the working set (let's call it the Mongo DB
> way)
> 3) storing data on disk, but not for direct manipulation, and use
> memory as a cache of objects that are active, flushing writes on disks
> when this objects change.
> It is now clear that VM is not the right set of tradeoffs, it was
> designed to be pretty fast but on the other hand there was a too big
> price to pay for all the rest: slow restarts, slow saving, and in turn
> slow replication, very complex code, and so forth.
> If you want pure speed with Redis, in memory is the way to go. So as a
> reaction to the email sent by Tim about his unhappiness with VM I used
> a few vacation days to start implementing a new model, that is was was
> listed above as number "3".
> The new set of tradeoffs are very different. The result is called
> diskstore, and this is how it works, in a few easy to digest points.
> - In diskstore key-value paris are stored on disk.
> - Memory works as a cache for live objects. Operations are only
> performed on in memory keys, so data on disk does not need to be
> stored in complex forms.
> - The cache-max-memory limit is strict. Redis will never use more RAM,
> even if we have 2 MB of max memory and 1 billion of keys. This works
> since now we don't need to take keys in memory.
> - Data is flushed on disk asynchronously. If a key is marked as dirty,
> and IO operation is scheduled for this key.
> - You can control the delay between modifications of keys and disk
> writes, so that if a key is modified a lot of time in small time, it
> will written only one time on disk.
> - Setting the delay to 0 means, sync it as fast as possible.
> - All I/O is performed by a single dedicated thread, that is
> long-running and not spawned on demand. The thread is awaked with a
> conditional variable.
> - The system is much simpler and sane than VM implementation, as there
> is no need to "undo" operations on race conditions.
> - Zero start-up time... as objects are loaded on demand.
> - There is negative caching. If a key is not on disk we remember it
> (if there is memory to do so). So we avoid accessing the disk again
> and again for keys that are not there.
> - The system is very fast if we access mostly our working set, and
> this working set happens to fit memory. Otherwise the system is much
> slower (I/O bound).
> - The system does not support BGSAVE currently, but will support this,
> and what is cool, with minimal overhead and used memory in the saving
> child, as data on disk is already written using the same serialization
> format as .rdb files. So our child will just copy files to obtain the
> .rdb. In the mean time the objects in cache are not flushed, so the
> system may use more memory, but it's not about copy-on-write, so it
> will use very very little additional memory.
> - Persistence is *PER KEY* this means, there is no point in time persistence.
> I think that the above points may give you an idea about how it works.
> But let me stress the per-key persistence point a bit.
> LPUSH a 0
> LPUSH b 1
> LPUHS a 2
> So after this commands we may have two IO scheduled operations
> pending. One for "a" and one for "b".
> Now imagine "a" is saved, and then the server goes down, Redis is
> brutally killed, or alike. The database will contain a consistent
> version of "a" and "b", but the version of "b" will be old, without
> the "1" pushed.
> Also currently MULTI/EXEC is not transactional, but this will be
> fixed, at least inside a multi/exec there will be guarantee that
> either all values or nothing will be synched to disk (this will be
> obtained using a journal file for transactions).
> Some more details. The system is composed of two layers:
> diskstore.c -- implements a trivial on disk key-value store
> dscache.c -- implements the more complex caching layer
> diskstore.c is currently a filesystem based KV store. This can be
> replaced with a B-TREE or something like that in the future if this
> will be needed. However even if the current implementation has a big
> overhead, it's pretty cool to have data as files, with very little
> chances of loosing data and corruption (rename is used for writes).
> But well if this does not scale well enough we'll drop it and replace
> it with something better.
> The current implementations is similar to bigdis. 256 directories
> containing 256 directories each are used, for a total of 65536 dir.
> Every key is put inside the dirs addressed by SHA1(key) translated in
> hex, for instance key "foo" is at:
> /0b/ee/0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33
> The cool thing is, diskstore.c exports a trivial interface to Redis,
> so it's very simple to replace with something else without touching
> too much internals.
> Stability: the system is obviously in alpha stage, however it works
> pretty well, without obvious crashes. But warning, it will crash with
> an assert if you try to BGSAVE.
> To try it download the "unstable" branch, edit redis.conf to enable
> diskstore. Play with it. Enjoy a redis instance that starts in no time
> even when it's full of data :)
> Feedbacks are really really appreciated here. I want to know what you
> think, what are your impressions on the design, tradeoffs, and so
> forth, how it feels when you experiment with it. If you want to see
> the inner workings set log level to "debug".
> The goal is to ship 2.4 ASAP with VM replaced with a good
> implementation of diskstore.
- Is support for loading RDB files planned (when diskstore is enabled)? If so, there will be a start-up delay as the RDB is re-written to the diskstore format, correct?
- Are there any ideas to change the process for starting new slaves, to address some of the existing shortcomings with replication, or do we think that the improved speed and lower memory requirements associated with RDB generation will negate the most serious concerns?
Overall, this looks very exciting, as it opens up even more use cases for Redis.
Thanks, Michael
On Tue, Jan 4, 2011 at 2:57 PM, Salvatore Sanfilippo <anti...@gmail.com>wrote:
> a few months after VM started to work my feeling about it started to > be not very good. I stated in the blog, and privately on IRC, > especially talking with Pieter, that VM was not the way to go for the > future of Redis, and that the new path we were taking about using less > memory was a much better approach. Together with cluster.
> However there are a number of different models for dealing with > datasets bigger than RAM for Redis. Just to cite a few:
> 1) virtual memory, where we swap values on disk as needed (The Redis > Virtual Memory way) > 2) storing data on disk, in a complex form so that operations can be > implemented directly in the on-disk representation, and using the OS > cache as a cache layer for the working set (let's call it the Mongo DB > way) > 3) storing data on disk, but not for direct manipulation, and use > memory as a cache of objects that are active, flushing writes on disks > when this objects change.
> It is now clear that VM is not the right set of tradeoffs, it was > designed to be pretty fast but on the other hand there was a too big > price to pay for all the rest: slow restarts, slow saving, and in turn > slow replication, very complex code, and so forth.
> If you want pure speed with Redis, in memory is the way to go. So as a > reaction to the email sent by Tim about his unhappiness with VM I used > a few vacation days to start implementing a new model, that is was was > listed above as number "3".
> The new set of tradeoffs are very different. The result is called > diskstore, and this is how it works, in a few easy to digest points.
> - In diskstore key-value paris are stored on disk. > - Memory works as a cache for live objects. Operations are only > performed on in memory keys, so data on disk does not need to be > stored in complex forms. > - The cache-max-memory limit is strict. Redis will never use more RAM, > even if we have 2 MB of max memory and 1 billion of keys. This works > since now we don't need to take keys in memory. > - Data is flushed on disk asynchronously. If a key is marked as dirty, > and IO operation is scheduled for this key. > - You can control the delay between modifications of keys and disk > writes, so that if a key is modified a lot of time in small time, it > will written only one time on disk. > - Setting the delay to 0 means, sync it as fast as possible. > - All I/O is performed by a single dedicated thread, that is > long-running and not spawned on demand. The thread is awaked with a > conditional variable. > - The system is much simpler and sane than VM implementation, as there > is no need to "undo" operations on race conditions. > - Zero start-up time... as objects are loaded on demand. > - There is negative caching. If a key is not on disk we remember it > (if there is memory to do so). So we avoid accessing the disk again > and again for keys that are not there. > - The system is very fast if we access mostly our working set, and > this working set happens to fit memory. Otherwise the system is much > slower (I/O bound). > - The system does not support BGSAVE currently, but will support this, > and what is cool, with minimal overhead and used memory in the saving > child, as data on disk is already written using the same serialization > format as .rdb files. So our child will just copy files to obtain the > .rdb. In the mean time the objects in cache are not flushed, so the > system may use more memory, but it's not about copy-on-write, so it > will use very very little additional memory. > - Persistence is *PER KEY* this means, there is no point in time > persistence.
> I think that the above points may give you an idea about how it works. > But let me stress the per-key persistence point a bit.
> LPUSH a 0 > LPUSH b 1 > LPUHS a 2
> So after this commands we may have two IO scheduled operations > pending. One for "a" and one for "b". > Now imagine "a" is saved, and then the server goes down, Redis is > brutally killed, or alike. The database will contain a consistent > version of "a" and "b", but the version of "b" will be old, without > the "1" pushed.
> Also currently MULTI/EXEC is not transactional, but this will be > fixed, at least inside a multi/exec there will be guarantee that > either all values or nothing will be synched to disk (this will be > obtained using a journal file for transactions).
> Some more details. The system is composed of two layers:
> diskstore.c -- implements a trivial on disk key-value store > dscache.c -- implements the more complex caching layer
> diskstore.c is currently a filesystem based KV store. This can be > replaced with a B-TREE or something like that in the future if this > will be needed. However even if the current implementation has a big > overhead, it's pretty cool to have data as files, with very little > chances of loosing data and corruption (rename is used for writes). > But well if this does not scale well enough we'll drop it and replace > it with something better.
> The current implementations is similar to bigdis. 256 directories > containing 256 directories each are used, for a total of 65536 dir. > Every key is put inside the dirs addressed by SHA1(key) translated in > hex, for instance key "foo" is at:
> /0b/ee/0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33
> The cool thing is, diskstore.c exports a trivial interface to Redis, > so it's very simple to replace with something else without touching > too much internals.
> Stability: the system is obviously in alpha stage, however it works > pretty well, without obvious crashes. But warning, it will crash with > an assert if you try to BGSAVE.
> To try it download the "unstable" branch, edit redis.conf to enable > diskstore. Play with it. Enjoy a redis instance that starts in no time > even when it's full of data :)
> Feedbacks are really really appreciated here. I want to know what you > think, what are your impressions on the design, tradeoffs, and so > forth, how it feels when you experiment with it. If you want to see > the inner workings set log level to "debug".
> The goal is to ship 2.4 ASAP with VM replaced with a good > implementation of diskstore.
> "We are what we repeatedly do. Excellence, therefore, is not an act, > but a habit." -- Aristotele
> -- > You received this message because you are subscribed to the Google Groups > "Redis DB" group. > To post to this group, send email to redis-db@googlegroups.com. > To unsubscribe from this group, send email to > redis-db+unsubscribe@googlegroups.com<redis-db%2Bunsubscribe@googlegroups.c om> > . > For more options, visit this group at > http://groups.google.com/group/redis-db?hl=en.
On Tue, Jan 4, 2011 at 9:49 PM, Andy McCurdy <sed...@gmail.com> wrote: > Hey Salvatore, > I've definitely had issues with VM, and am excited to see how diskstore > works out. In theory, the tradeoffs seem much better. Couple questions.
Hello Andy, thanks for your feedbacks and your questions, they are very interesting.
> What happens if I have a producer (or perhaps many producers) of data that > add dirty data to Redis faster than the Diskstore IO thread can write it? > Specifically, what if the dirty data approaches the max memory limits? Will > the memory footprint grow temporarily until those values are persisted, then > free itself?
Redis will start evicting keys that are not dirty from the cache, but when it eventually runs completely out of memory compared to the max memory limit, it will start blocking waiting for disk I/O to complete so that memory can be reclaimed. Currently this is done not incrementally enough and will be improved, but there are no other solutions.
This basically means that redis will survive without problems to peaks, but if the writes are sustainedly faster than IO eventually there is to wait for IO to compelte :)
But I think it's a good idea that the new semantics is to use cache-max-memory as an *hard* limit. The only reason it Redis will use more memory than that is to process a command that involves using values that are larger than the max memory setting itself (for instance you set maxmemory to 1mb and then do an lpush against a list that in memory is 2mb, but this is not a problem I think).
> What about if I have a large data structure? One with thousands of items in > a list or zset. If this key is written to frequently (multiple times a > second), will the IO thread get backed up by having to write such a large > structure over and over again?
It is mainly up to you, as there is a setting called cache-flush-delay. If you set it to, for instance, 10, then the first time you write against the big zset it will get queued for disk IO, with time for completion set to now+10 seconds. The next writes will not add new writes to the queue, so after 10 seconds the key will be transfered on disk, and then if there is a new operation against this key, it will be queued again. The effect is that the key will be written on disk every 10 seconds if there is a continuos stream of writes against it. So it's a clear trade-off between durability and speed.
I want to point out that for things like leader boards and big sorted sets, lists, ...., the way to go is to use Redis with the default in-memory back end. I guess that in many conditions it will make a lot of sense to run two Redis instances: one for bulk data, and one for fast things like the leader board and other things that must be very fast.
> -andy > On Tue, Jan 4, 2011 at 11:57 AM, Salvatore Sanfilippo <anti...@gmail.com> > wrote:
>> Hi all,
>> a few months after VM started to work my feeling about it started to >> be not very good. I stated in the blog, and privately on IRC, >> especially talking with Pieter, that VM was not the way to go for the >> future of Redis, and that the new path we were taking about using less >> memory was a much better approach. Together with cluster.
>> However there are a number of different models for dealing with >> datasets bigger than RAM for Redis. Just to cite a few:
>> 1) virtual memory, where we swap values on disk as needed (The Redis >> Virtual Memory way) >> 2) storing data on disk, in a complex form so that operations can be >> implemented directly in the on-disk representation, and using the OS >> cache as a cache layer for the working set (let's call it the Mongo DB >> way) >> 3) storing data on disk, but not for direct manipulation, and use >> memory as a cache of objects that are active, flushing writes on disks >> when this objects change.
>> It is now clear that VM is not the right set of tradeoffs, it was >> designed to be pretty fast but on the other hand there was a too big >> price to pay for all the rest: slow restarts, slow saving, and in turn >> slow replication, very complex code, and so forth.
>> If you want pure speed with Redis, in memory is the way to go. So as a >> reaction to the email sent by Tim about his unhappiness with VM I used >> a few vacation days to start implementing a new model, that is was was >> listed above as number "3".
>> The new set of tradeoffs are very different. The result is called >> diskstore, and this is how it works, in a few easy to digest points.
>> - In diskstore key-value paris are stored on disk. >> - Memory works as a cache for live objects. Operations are only >> performed on in memory keys, so data on disk does not need to be >> stored in complex forms. >> - The cache-max-memory limit is strict. Redis will never use more RAM, >> even if we have 2 MB of max memory and 1 billion of keys. This works >> since now we don't need to take keys in memory. >> - Data is flushed on disk asynchronously. If a key is marked as dirty, >> and IO operation is scheduled for this key. >> - You can control the delay between modifications of keys and disk >> writes, so that if a key is modified a lot of time in small time, it >> will written only one time on disk. >> - Setting the delay to 0 means, sync it as fast as possible. >> - All I/O is performed by a single dedicated thread, that is >> long-running and not spawned on demand. The thread is awaked with a >> conditional variable. >> - The system is much simpler and sane than VM implementation, as there >> is no need to "undo" operations on race conditions. >> - Zero start-up time... as objects are loaded on demand. >> - There is negative caching. If a key is not on disk we remember it >> (if there is memory to do so). So we avoid accessing the disk again >> and again for keys that are not there. >> - The system is very fast if we access mostly our working set, and >> this working set happens to fit memory. Otherwise the system is much >> slower (I/O bound). >> - The system does not support BGSAVE currently, but will support this, >> and what is cool, with minimal overhead and used memory in the saving >> child, as data on disk is already written using the same serialization >> format as .rdb files. So our child will just copy files to obtain the >> .rdb. In the mean time the objects in cache are not flushed, so the >> system may use more memory, but it's not about copy-on-write, so it >> will use very very little additional memory. >> - Persistence is *PER KEY* this means, there is no point in time >> persistence.
>> I think that the above points may give you an idea about how it works. >> But let me stress the per-key persistence point a bit.
>> LPUSH a 0 >> LPUSH b 1 >> LPUHS a 2
>> So after this commands we may have two IO scheduled operations >> pending. One for "a" and one for "b". >> Now imagine "a" is saved, and then the server goes down, Redis is >> brutally killed, or alike. The database will contain a consistent >> version of "a" and "b", but the version of "b" will be old, without >> the "1" pushed.
>> Also currently MULTI/EXEC is not transactional, but this will be >> fixed, at least inside a multi/exec there will be guarantee that >> either all values or nothing will be synched to disk (this will be >> obtained using a journal file for transactions).
>> Some more details. The system is composed of two layers:
>> diskstore.c -- implements a trivial on disk key-value store >> dscache.c -- implements the more complex caching layer
>> diskstore.c is currently a filesystem based KV store. This can be >> replaced with a B-TREE or something like that in the future if this >> will be needed. However even if the current implementation has a big >> overhead, it's pretty cool to have data as files, with very little >> chances of loosing data and corruption (rename is used for writes). >> But well if this does not scale well enough we'll drop it and replace >> it with something better.
>> The current implementations is similar to bigdis. 256 directories >> containing 256 directories each are used, for a total of 65536 dir. >> Every key is put inside the dirs addressed by SHA1(key) translated in >> hex, for instance key "foo" is at:
>> The cool thing is, diskstore.c exports a trivial interface to Redis, >> so it's very simple to replace with something else without touching >> too much internals.
>> Stability: the system is obviously in alpha stage, however it works >> pretty well, without obvious crashes. But warning, it will crash with >> an assert if you try to BGSAVE.
>> To try it download the "unstable" branch, edit redis.conf to enable >> diskstore. Play with it. Enjoy a redis instance that starts in no time >> even when it's full of data :)
>> Feedbacks are really really appreciated here. I want to know what you >> think, what are your impressions on the design, tradeoffs, and so >> forth, how it feels when you experiment with it. If you want to see >> the inner workings set log level to "debug".
>> The goal is to ship 2.4 ASAP with VM replaced with a good >> implementation of diskstore.
>> "We are what we repeatedly do. Excellence, therefore, is not an act, >> but a habit." -- Aristotele
>> -- >> You received this message because you are subscribed to the Google Groups >> "Redis DB" group. >> To post to this group, send email to redis-db@googlegroups.com. >> To unsubscribe from this group, send email to >> redis-db+unsubscribe@googlegroups.com. >> For more options, visit this group at >> http://groups.google.com/group/redis-db?hl=en.
> -- > You received this message because you are subscribed to the Google Groups > "Redis DB" group. > To post to this group, send email to redis-db@googlegroups.com. > To unsubscribe from this group, send email to
Salvatore, this sounds great. I told you on twitter about my b-tree implementation, and it's funny that I wrote it after having started my project with almost exactly the same diskstore pattern (although it was 100 sub directories per directory, which is a big difference). It didn't scale well with millions of objects, and made the engine become very slow over time. I assume the difference between 100^2 and 256^2 will make it scale better though. Also, for me most reads were from disk as caches were very small back then, so I'm guessing it's less of an issue if most of the reads are from memory. Back then I eventually wrote a b-tree based storage engine that worked much better with millions of varying size, growing blobs (inverted indexes mainly).
Another question is - can diskstore (optionally of course) totally replace RDBs? besides the fact that backups will be slower, why not allowing users to use diskstore as the only store in redis?
On Tue, Jan 4, 2011 at 9:57 PM, Salvatore Sanfilippo <anti...@gmail.com>wrote:
> a few months after VM started to work my feeling about it started to > be not very good. I stated in the blog, and privately on IRC, > especially talking with Pieter, that VM was not the way to go for the > future of Redis, and that the new path we were taking about using less > memory was a much better approach. Together with cluster.
> However there are a number of different models for dealing with > datasets bigger than RAM for Redis. Just to cite a few:
> 1) virtual memory, where we swap values on disk as needed (The Redis > Virtual Memory way) > 2) storing data on disk, in a complex form so that operations can be > implemented directly in the on-disk representation, and using the OS > cache as a cache layer for the working set (let's call it the Mongo DB > way) > 3) storing data on disk, but not for direct manipulation, and use > memory as a cache of objects that are active, flushing writes on disks > when this objects change.
> It is now clear that VM is not the right set of tradeoffs, it was > designed to be pretty fast but on the other hand there was a too big > price to pay for all the rest: slow restarts, slow saving, and in turn > slow replication, very complex code, and so forth.
> If you want pure speed with Redis, in memory is the way to go. So as a > reaction to the email sent by Tim about his unhappiness with VM I used > a few vacation days to start implementing a new model, that is was was > listed above as number "3".
> The new set of tradeoffs are very different. The result is called > diskstore, and this is how it works, in a few easy to digest points.
> - In diskstore key-value paris are stored on disk. > - Memory works as a cache for live objects. Operations are only > performed on in memory keys, so data on disk does not need to be > stored in complex forms. > - The cache-max-memory limit is strict. Redis will never use more RAM, > even if we have 2 MB of max memory and 1 billion of keys. This works > since now we don't need to take keys in memory. > - Data is flushed on disk asynchronously. If a key is marked as dirty, > and IO operation is scheduled for this key. > - You can control the delay between modifications of keys and disk > writes, so that if a key is modified a lot of time in small time, it > will written only one time on disk. > - Setting the delay to 0 means, sync it as fast as possible. > - All I/O is performed by a single dedicated thread, that is > long-running and not spawned on demand. The thread is awaked with a > conditional variable. > - The system is much simpler and sane than VM implementation, as there > is no need to "undo" operations on race conditions. > - Zero start-up time... as objects are loaded on demand. > - There is negative caching. If a key is not on disk we remember it > (if there is memory to do so). So we avoid accessing the disk again > and again for keys that are not there. > - The system is very fast if we access mostly our working set, and > this working set happens to fit memory. Otherwise the system is much > slower (I/O bound). > - The system does not support BGSAVE currently, but will support this, > and what is cool, with minimal overhead and used memory in the saving > child, as data on disk is already written using the same serialization > format as .rdb files. So our child will just copy files to obtain the > .rdb. In the mean time the objects in cache are not flushed, so the > system may use more memory, but it's not about copy-on-write, so it > will use very very little additional memory. > - Persistence is *PER KEY* this means, there is no point in time > persistence.
> I think that the above points may give you an idea about how it works. > But let me stress the per-key persistence point a bit.
> LPUSH a 0 > LPUSH b 1 > LPUHS a 2
> So after this commands we may have two IO scheduled operations > pending. One for "a" and one for "b". > Now imagine "a" is saved, and then the server goes down, Redis is > brutally killed, or alike. The database will contain a consistent > version of "a" and "b", but the version of "b" will be old, without > the "1" pushed.
> Also currently MULTI/EXEC is not transactional, but this will be > fixed, at least inside a multi/exec there will be guarantee that > either all values or nothing will be synched to disk (this will be > obtained using a journal file for transactions).
> Some more details. The system is composed of two layers:
> diskstore.c -- implements a trivial on disk key-value store > dscache.c -- implements the more complex caching layer
> diskstore.c is currently a filesystem based KV store. This can be > replaced with a B-TREE or something like that in the future if this > will be needed. However even if the current implementation has a big > overhead, it's pretty cool to have data as files, with very little > chances of loosing data and corruption (rename is used for writes). > But well if this does not scale well enough we'll drop it and replace > it with something better.
> The current implementations is similar to bigdis. 256 directories > containing 256 directories each are used, for a total of 65536 dir. > Every key is put inside the dirs addressed by SHA1(key) translated in > hex, for instance key "foo" is at:
> /0b/ee/0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33
> The cool thing is, diskstore.c exports a trivial interface to Redis, > so it's very simple to replace with something else without touching > too much internals.
> Stability: the system is obviously in alpha stage, however it works > pretty well, without obvious crashes. But warning, it will crash with > an assert if you try to BGSAVE.
> To try it download the "unstable" branch, edit redis.conf to enable > diskstore. Play with it. Enjoy a redis instance that starts in no time > even when it's full of data :)
> Feedbacks are really really appreciated here. I want to know what you > think, what are your impressions on the design, tradeoffs, and so > forth, how it feels when you experiment with it. If you want to see > the inner workings set log level to "debug".
> The goal is to ship 2.4 ASAP with VM replaced with a good > implementation of diskstore.
> "We are what we repeatedly do. Excellence, therefore, is not an act, > but a habit." -- Aristotele
> -- > You received this message because you are subscribed to the Google Groups > "Redis DB" group. > To post to this group, send email to redis-db@googlegroups.com. > To unsubscribe from this group, send email to > redis-db+unsubscribe@googlegroups.com<redis-db%2Bunsubscribe@googlegroups.c om> > . > For more options, visit this group at > http://groups.google.com/group/redis-db?hl=en.
On Tue, Jan 4, 2011 at 9:50 PM, andrychris <ch...@noneofyo.biz> wrote:
Thank you for your comments Chris! Replying below.
> * Once redis reaches the cache RAM limit, the machine starts thrashing > pretty bad, and ends up spending a good deal of time waiting for disk > i/o and generally slowing redis to a halt. The SETs per second at > that point quickly drops of from about 25-26k to 10k, and then to > nothing as it thrashes back and forth.
Yes, this is expected behavior but for the fact that currently when we reach the condition that there is no memory to create a longer write queue the whole current queue is processed in a blocking way. Instead I need to make this incremental. It's very easy, just the current implementation was a two minutes effort to make it working without surpassing the memory limit. But whatever we can do of course when there are too many writes we'll start waiting for I/O.
> * I haven't spent as much time testing this out, but increasing cache- > flush-delay seems to decrease the "hit" that occurs once we reach the > RAM limit, but that success is short lived :)
Increasing cache-flush-delay and cache-max-memory helps in two different ways:
1) increasing cache-flush-delay helps a lot if you write again and again again 2) increasing cache-max-memory helps to handle well a peak that will last for more time.
But if the write-heavy workload continues for too long in the end performances will start to get lower and lower, until they'll match the performance of the I/O subsystem.
> I will freely admit that my test is a little unfair at best, and > unrealistic at worst -- obviously in a production scenario I wouldn't > be using redis with a disk backing store simply to constantly issue > SET requests. However, I've been checking out various options to do > more or less exactly what the future redis v2.4 aims to do, and "what > happens when you run out of RAM" is a question that I need to answer, > and the answer, in most cases, seems to be "noooooo!!!!!!!" :). Not > entirely surprising given the stress I'm putting on it, but I need to > know what happens in worst case scenarios.
Completely agree with your vision, the worst case scenario must be very clear. And of course we can improve the I/O performance itself.
For instance, serializing / deserializing hashes / lists / sets is slow, but now that we have specially encoded data type that are... guess what? A linear array of data! We can serialize it on disk just by copying.
Many optimizations are possible.
> That being said, while having something that magically experienced > zero performance degradation when writing to disk instead of RAM would > be just lovely, I'm not gonna hold my breath :). I am looking for > consistent and predictable performance characteristics, however. At > the moment, redis effectively locks up if it's under heavy load, has > reached the cache size limit, and the db is continuing to grow.
Yes now it consumes the whole queue of 200 writes, every time... not exactly a good idea. Will fix it soon.
> I'll be continuing some of my testing in the near future to include > some scenarios that are a bit more realistic, mixing GET/SET/etc. > requests in certain ratios and so forth, using things that are at > least closer to "real world" data (for us, anyway).
Great
> Depending on how far I get, I might offer up a branch with a few > tweaks of my own, but that depends largely on how much time I can > devote to it in the near future :).
Thanks
> To close, I'll just say that for being as new and early stage as the > branch formerly known as 'diskstore' is, it's looking pretty good. > Thanks for all the hard work!
> On Jan 4, 2:57 pm, Salvatore Sanfilippo <anti...@gmail.com> wrote: >> Hi all,
>> a few months after VM started to work my feeling about it started to >> be not very good. I stated in the blog, and privately on IRC, >> especially talking with Pieter, that VM was not the way to go for the >> future of Redis, and that the new path we were taking about using less >> memory was a much better approach. Together with cluster.
>> However there are a number of different models for dealing with >> datasets bigger than RAM for Redis. Just to cite a few:
>> 1) virtual memory, where we swap values on disk as needed (The Redis >> Virtual Memory way) >> 2) storing data on disk, in a complex form so that operations can be >> implemented directly in the on-disk representation, and using the OS >> cache as a cache layer for the working set (let's call it the Mongo DB >> way) >> 3) storing data on disk, but not for direct manipulation, and use >> memory as a cache of objects that are active, flushing writes on disks >> when this objects change.
>> It is now clear that VM is not the right set of tradeoffs, it was >> designed to be pretty fast but on the other hand there was a too big >> price to pay for all the rest: slow restarts, slow saving, and in turn >> slow replication, very complex code, and so forth.
>> If you want pure speed with Redis, in memory is the way to go. So as a >> reaction to the email sent by Tim about his unhappiness with VM I used >> a few vacation days to start implementing a new model, that is was was >> listed above as number "3".
>> The new set of tradeoffs are very different. The result is called >> diskstore, and this is how it works, in a few easy to digest points.
>> - In diskstore key-value paris are stored on disk. >> - Memory works as a cache for live objects. Operations are only >> performed on in memory keys, so data on disk does not need to be >> stored in complex forms. >> - The cache-max-memory limit is strict. Redis will never use more RAM, >> even if we have 2 MB of max memory and 1 billion of keys. This works >> since now we don't need to take keys in memory. >> - Data is flushed on disk asynchronously. If a key is marked as dirty, >> and IO operation is scheduled for this key. >> - You can control the delay between modifications of keys and disk >> writes, so that if a key is modified a lot of time in small time, it >> will written only one time on disk. >> - Setting the delay to 0 means, sync it as fast as possible. >> - All I/O is performed by a single dedicated thread, that is >> long-running and not spawned on demand. The thread is awaked with a >> conditional variable. >> - The system is much simpler and sane than VM implementation, as there >> is no need to "undo" operations on race conditions. >> - Zero start-up time... as objects are loaded on demand. >> - There is negative caching. If a key is not on disk we remember it >> (if there is memory to do so). So we avoid accessing the disk again >> and again for keys that are not there. >> - The system is very fast if we access mostly our working set, and >> this working set happens to fit memory. Otherwise the system is much >> slower (I/O bound). >> - The system does not support BGSAVE currently, but will support this, >> and what is cool, with minimal overhead and used memory in the saving >> child, as data on disk is already written using the same serialization >> format as .rdb files. So our child will just copy files to obtain the >> .rdb. In the mean time the objects in cache are not flushed, so the >> system may use more memory, but it's not about copy-on-write, so it >> will use very very little additional memory. >> - Persistence is *PER KEY* this means, there is no point in time persistence.
>> I think that the above points may give you an idea about how it works. >> But let me stress the per-key persistence point a bit.
>> LPUSH a 0 >> LPUSH b 1 >> LPUHS a 2
>> So after this commands we may have two IO scheduled operations >> pending. One for "a" and one for "b". >> Now imagine "a" is saved, and then the server goes down, Redis is >> brutally killed, or alike. The database will contain a consistent >> version of "a" and "b", but the version of "b" will be old, without >> the "1" pushed.
>> Also currently MULTI/EXEC is not transactional, but this will be >> fixed, at least inside a multi/exec there will be guarantee that >> either all values or nothing will be synched to disk (this will be >> obtained using a journal file for transactions).
>> Some more details. The system is composed of two layers:
>> diskstore.c -- implements a trivial on disk key-value store >> dscache.c -- implements the more complex caching layer
>> diskstore.c is currently a filesystem based KV store. This can be >> replaced with a B-TREE or something like that in the future if this >> will be needed. However even if the current implementation has a big >> overhead, it's pretty cool to have data as files, with very little >> chances of loosing data and corruption (rename is used for writes). >> But well if this does not scale well enough we'll drop it and replace >> it with something better.
>> The current implementations is similar to bigdis. 256 directories >> containing 256 directories each are used, for a total of 65536 dir. >> Every key is put inside the dirs addressed by SHA1(key) translated in >> hex, for instance key "foo" is at:
>> The cool thing is, diskstore.c exports a trivial interface to Redis, >> so it's very simple to replace with something else without touching >> too much internals.
>> Stability: the system is obviously in alpha stage, however it works >> pretty well, without obvious crashes. But warning, it will crash with >> an assert if you try to BGSAVE.
>> To try it download the "unstable" branch, edit redis.conf to enable >> diskstore. Play with it. Enjoy a redis instance that starts in no time >> even when it's full of data :)
>> Feedbacks are really really appreciated here. I want to know what you >> think, what are your impressions on the design, tradeoffs, and so >> forth, how it feels when you experiment with it. If you want to
On Tue, Jan 4, 2011 at 10:24 PM, Michael Russo <mjru...@gmail.com> wrote: > A few questions: > - Is support for loading RDB files planned (when diskstore is enabled)? If > so, there will be a start-up delay as the RDB is re-written to the diskstore > format, correct?
The idea is to provide a tool to convert an .rdb file into diskstore format. The contrary will be handled natively by redis, just calling BGSAVE.
> - Are there any ideas to change the process for starting new slaves, to > address some of the existing shortcomings with replication, or do we think > that the improved speed and lower memory requirements associated with RDB > generation will negate the most serious concerns?
I think that BGSAVE will be a very efficient operation with diskstore, so will not create the problems that created when VM was active. We'll see how the actual implementation works, but I'm very positive about this.
> Overall, this looks very exciting, as it opens up even more use cases for > Redis.
> Thanks, > Michael > On Tue, Jan 4, 2011 at 2:57 PM, Salvatore Sanfilippo <anti...@gmail.com> > wrote:
>> Hi all,
>> a few months after VM started to work my feeling about it started to >> be not very good. I stated in the blog, and privately on IRC, >> especially talking with Pieter, that VM was not the way to go for the >> future of Redis, and that the new path we were taking about using less >> memory was a much better approach. Together with cluster.
>> However there are a number of different models for dealing with >> datasets bigger than RAM for Redis. Just to cite a few:
>> 1) virtual memory, where we swap values on disk as needed (The Redis >> Virtual Memory way) >> 2) storing data on disk, in a complex form so that operations can be >> implemented directly in the on-disk representation, and using the OS >> cache as a cache layer for the working set (let's call it the Mongo DB >> way) >> 3) storing data on disk, but not for direct manipulation, and use >> memory as a cache of objects that are active, flushing writes on disks >> when this objects change.
>> It is now clear that VM is not the right set of tradeoffs, it was >> designed to be pretty fast but on the other hand there was a too big >> price to pay for all the rest: slow restarts, slow saving, and in turn >> slow replication, very complex code, and so forth.
>> If you want pure speed with Redis, in memory is the way to go. So as a >> reaction to the email sent by Tim about his unhappiness with VM I used >> a few vacation days to start implementing a new model, that is was was >> listed above as number "3".
>> The new set of tradeoffs are very different. The result is called >> diskstore, and this is how it works, in a few easy to digest points.
>> - In diskstore key-value paris are stored on disk. >> - Memory works as a cache for live objects. Operations are only >> performed on in memory keys, so data on disk does not need to be >> stored in complex forms. >> - The cache-max-memory limit is strict. Redis will never use more RAM, >> even if we have 2 MB of max memory and 1 billion of keys. This works >> since now we don't need to take keys in memory. >> - Data is flushed on disk asynchronously. If a key is marked as dirty, >> and IO operation is scheduled for this key. >> - You can control the delay between modifications of keys and disk >> writes, so that if a key is modified a lot of time in small time, it >> will written only one time on disk. >> - Setting the delay to 0 means, sync it as fast as possible. >> - All I/O is performed by a single dedicated thread, that is >> long-running and not spawned on demand. The thread is awaked with a >> conditional variable. >> - The system is much simpler and sane than VM implementation, as there >> is no need to "undo" operations on race conditions. >> - Zero start-up time... as objects are loaded on demand. >> - There is negative caching. If a key is not on disk we remember it >> (if there is memory to do so). So we avoid accessing the disk again >> and again for keys that are not there. >> - The system is very fast if we access mostly our working set, and >> this working set happens to fit memory. Otherwise the system is much >> slower (I/O bound). >> - The system does not support BGSAVE currently, but will support this, >> and what is cool, with minimal overhead and used memory in the saving >> child, as data on disk is already written using the same serialization >> format as .rdb files. So our child will just copy files to obtain the >> .rdb. In the mean time the objects in cache are not flushed, so the >> system may use more memory, but it's not about copy-on-write, so it >> will use very very little additional memory. >> - Persistence is *PER KEY* this means, there is no point in time >> persistence.
>> I think that the above points may give you an idea about how it works. >> But let me stress the per-key persistence point a bit.
>> LPUSH a 0 >> LPUSH b 1 >> LPUHS a 2
>> So after this commands we may have two IO scheduled operations >> pending. One for "a" and one for "b". >> Now imagine "a" is saved, and then the server goes down, Redis is >> brutally killed, or alike. The database will contain a consistent >> version of "a" and "b", but the version of "b" will be old, without >> the "1" pushed.
>> Also currently MULTI/EXEC is not transactional, but this will be >> fixed, at least inside a multi/exec there will be guarantee that >> either all values or nothing will be synched to disk (this will be >> obtained using a journal file for transactions).
>> Some more details. The system is composed of two layers:
>> diskstore.c -- implements a trivial on disk key-value store >> dscache.c -- implements the more complex caching layer
>> diskstore.c is currently a filesystem based KV store. This can be >> replaced with a B-TREE or something like that in the future if this >> will be needed. However even if the current implementation has a big >> overhead, it's pretty cool to have data as files, with very little >> chances of loosing data and corruption (rename is used for writes). >> But well if this does not scale well enough we'll drop it and replace >> it with something better.
>> The current implementations is similar to bigdis. 256 directories >> containing 256 directories each are used, for a total of 65536 dir. >> Every key is put inside the dirs addressed by SHA1(key) translated in >> hex, for instance key "foo" is at:
>> The cool thing is, diskstore.c exports a trivial interface to Redis, >> so it's very simple to replace with something else without touching >> too much internals.
>> Stability: the system is obviously in alpha stage, however it works >> pretty well, without obvious crashes. But warning, it will crash with >> an assert if you try to BGSAVE.
>> To try it download the "unstable" branch, edit redis.conf to enable >> diskstore. Play with it. Enjoy a redis instance that starts in no time >> even when it's full of data :)
>> Feedbacks are really really appreciated here. I want to know what you >> think, what are your impressions on the design, tradeoffs, and so >> forth, how it feels when you experiment with it. If you want to see >> the inner workings set log level to "debug".
>> The goal is to ship 2.4 ASAP with VM replaced with a good >> implementation of diskstore.
>> "We are what we repeatedly do. Excellence, therefore, is not an act, >> but a habit." -- Aristotele
>> -- >> You received this message because you are subscribed to the Google Groups >> "Redis DB" group. >> To post to this group, send email to redis-db@googlegroups.com. >> To unsubscribe from this group, send email to >> redis-db+unsubscribe@googlegroups.com. >> For more options, visit this group at >> http://groups.google.com/group/redis-db?hl=en.
> -- > You received this message because you are subscribed to the Google Groups > "Redis DB" group. > To post to this group, send email to redis-db@googlegroups.com. > To unsubscribe from this group, send email to > redis-db+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/redis-db?hl=en.
On Tue, Jan 4, 2011 at 10:36 PM, Dvir Volk <dvir...@gmail.com> wrote: > Salvatore, this sounds great. > I told you on twitter about my b-tree implementation, and it's funny that I > wrote it after having started my project with almost exactly the same > diskstore pattern (although it was 100 sub directories per directory, which > is a big difference).
Dvir right this is very interesting, 100 or 256 or ... 1000?
This was my reasoning. I want it to work well with 1 billion keys. So:
1000000000/(256*256) = 15258
15k files per dir are more or less near to the limit to get good performances. On the other side to create 65536 directory, or to scan 65536 directory for BGSAVE, seems acceptable, and works pretty fast in practice.
> It didn't scale well with millions of objects, and made the engine become > very slow over time. I assume the difference between 100^2 and 256^2 will > make it scale better though. Also, for me most reads were from disk as > caches were very small back then, so I'm guessing it's less of an issue if > most of the reads are from memory.
An important variable here is the filesystem in use for sure. Btw there is no reason to don't try with a B-TREE, given the interface is pretty abstract. We should give it a try at least to get the feeling about what other solutions can provide us.
> Back then I eventually wrote a b-tree based storage engine that worked much > better with millions of varying size, growing blobs (inverted indexes > mainly).
Cool. If you want to hack a patch it can be an interesting experiment. In the end we'll need something able to auto compact itself if we'll select the B-TREE approach in the end.
> Another question is - can diskstore (optionally of course) totally replace > RDBs? besides the fact that backups will be slower, why not allowing users > to use diskstore as the only store in redis?
It's already this way. you can *optionally* call BGSAVE when diskstore is active (or better you will ... still not implemented), but there is no need for it. BGSAVE will also be called for replication if our diskstore instance is a master, but our diskstore-BGSAVE should work very well.
I'll try implementing it tomorrow so'll know very soon :)
> On Tue, Jan 4, 2011 at 9:57 PM, Salvatore Sanfilippo <anti...@gmail.com> > wrote:
>> Hi all,
>> a few months after VM started to work my feeling about it started to >> be not very good. I stated in the blog, and privately on IRC, >> especially talking with Pieter, that VM was not the way to go for the >> future of Redis, and that the new path we were taking about using less >> memory was a much better approach. Together with cluster.
>> However there are a number of different models for dealing with >> datasets bigger than RAM for Redis. Just to cite a few:
>> 1) virtual memory, where we swap values on disk as needed (The Redis >> Virtual Memory way) >> 2) storing data on disk, in a complex form so that operations can be >> implemented directly in the on-disk representation, and using the OS >> cache as a cache layer for the working set (let's call it the Mongo DB >> way) >> 3) storing data on disk, but not for direct manipulation, and use >> memory as a cache of objects that are active, flushing writes on disks >> when this objects change.
>> It is now clear that VM is not the right set of tradeoffs, it was >> designed to be pretty fast but on the other hand there was a too big >> price to pay for all the rest: slow restarts, slow saving, and in turn >> slow replication, very complex code, and so forth.
>> If you want pure speed with Redis, in memory is the way to go. So as a >> reaction to the email sent by Tim about his unhappiness with VM I used >> a few vacation days to start implementing a new model, that is was was >> listed above as number "3".
>> The new set of tradeoffs are very different. The result is called >> diskstore, and this is how it works, in a few easy to digest points.
>> - In diskstore key-value paris are stored on disk. >> - Memory works as a cache for live objects. Operations are only >> performed on in memory keys, so data on disk does not need to be >> stored in complex forms. >> - The cache-max-memory limit is strict. Redis will never use more RAM, >> even if we have 2 MB of max memory and 1 billion of keys. This works >> since now we don't need to take keys in memory. >> - Data is flushed on disk asynchronously. If a key is marked as dirty, >> and IO operation is scheduled for this key. >> - You can control the delay between modifications of keys and disk >> writes, so that if a key is modified a lot of time in small time, it >> will written only one time on disk. >> - Setting the delay to 0 means, sync it as fast as possible. >> - All I/O is performed by a single dedicated thread, that is >> long-running and not spawned on demand. The thread is awaked with a >> conditional variable. >> - The system is much simpler and sane than VM implementation, as there >> is no need to "undo" operations on race conditions. >> - Zero start-up time... as objects are loaded on demand. >> - There is negative caching. If a key is not on disk we remember it >> (if there is memory to do so). So we avoid accessing the disk again >> and again for keys that are not there. >> - The system is very fast if we access mostly our working set, and >> this working set happens to fit memory. Otherwise the system is much >> slower (I/O bound). >> - The system does not support BGSAVE currently, but will support this, >> and what is cool, with minimal overhead and used memory in the saving >> child, as data on disk is already written using the same serialization >> format as .rdb files. So our child will just copy files to obtain the >> .rdb. In the mean time the objects in cache are not flushed, so the >> system may use more memory, but it's not about copy-on-write, so it >> will use very very little additional memory. >> - Persistence is *PER KEY* this means, there is no point in time >> persistence.
>> I think that the above points may give you an idea about how it works. >> But let me stress the per-key persistence point a bit.
>> LPUSH a 0 >> LPUSH b 1 >> LPUHS a 2
>> So after this commands we may have two IO scheduled operations >> pending. One for "a" and one for "b". >> Now imagine "a" is saved, and then the server goes down, Redis is >> brutally killed, or alike. The database will contain a consistent >> version of "a" and "b", but the version of "b" will be old, without >> the "1" pushed.
>> Also currently MULTI/EXEC is not transactional, but this will be >> fixed, at least inside a multi/exec there will be guarantee that >> either all values or nothing will be synched to disk (this will be >> obtained using a journal file for transactions).
>> Some more details. The system is composed of two layers:
>> diskstore.c -- implements a trivial on disk key-value store >> dscache.c -- implements the more complex caching layer
>> diskstore.c is currently a filesystem based KV store. This can be >> replaced with a B-TREE or something like that in the future if this >> will be needed. However even if the current implementation has a big >> overhead, it's pretty cool to have data as files, with very little >> chances of loosing data and corruption (rename is used for writes). >> But well if this does not scale well enough we'll drop it and replace >> it with something better.
>> The current implementations is similar to bigdis. 256 directories >> containing 256 directories each are used, for a total of 65536 dir. >> Every key is put inside the dirs addressed by SHA1(key) translated in >> hex, for instance key "foo" is at:
>> The cool thing is, diskstore.c exports a trivial interface to Redis, >> so it's very simple to replace with something else without touching >> too much internals.
>> Stability: the system is obviously in alpha stage, however it works >> pretty well, without obvious crashes. But warning, it will crash with >> an assert if you try to BGSAVE.
>> To try it download the "unstable" branch, edit redis.conf to enable >> diskstore. Play with it. Enjoy a redis instance that starts in no time >> even when it's full of data :)
>> Feedbacks are really really appreciated here. I want to know what you >> think, what are your impressions on the design, tradeoffs, and so >> forth, how it feels when you experiment with it. If you want to see >> the inner workings set log level to "debug".
>> The goal is to ship 2.4 ASAP with VM replaced with a good >> implementation of diskstore.
>> "We are what we repeatedly do. Excellence, therefore, is not an act, >> but a habit." -- Aristotele
>> -- >> You received this message because you are subscribed to the Google Groups >> "Redis DB" group. >> To post to this group, send email to redis-db@googlegroups.com. >> To unsubscribe from this group, send email to >> redis-db+unsubscribe@googlegroups.com. >> For more options, visit this group at >> http://groups.google.com/group/redis-db?hl=en.
> -- > You received this message because you are subscribed to the Google Groups > "Redis DB" group. > To post to this group, send email to redis-db@googlegroups.com. > To unsubscribe from this group, send email to > redis-db+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/redis-db?hl=en.
This is really cool. Most dataset in practice has a % of items as "hot". This approach makes redis scale UP well. For example, we plan to use a Dell SSD box to replace regular hard disk in production, this will definitely help throughput I think :-)
> - Is support for loading RDB files planned (when diskstore is enabled)? > > If so, there will be a start-up delay as the RDB is re-written to the
> diskstore format, correct?
> The idea is to provide a tool to convert an .rdb file into diskstore > format. > The contrary will be handled natively by redis, just calling BGSAVE.
On question here "the contrary" means convert diskstore into rdb format? If so, why we still need it after the adoption of diskstore?
What is the way to do BACKUP for diskstore ? Just copy / rsync the entire directory tree I hope?
> If you want pure speed with Redis, in memory is the way to go. So as a > reaction to the email sent by Tim about his unhappiness with VM I used > a few vacation days to start implementing a new model, that is was was > listed above as number "3".
Pardon my ignorance, but I am curious as to if this is similar to the way MongoDB works (it used memory-mapped files and fsyncs at regular intervals).
If not, what would be the disadvantages if Redis adopted such a scheme?
On Tue, Jan 4, 2011 at 10:43 PM, Baishampayan Ghose <b.gh...@gmail.com>wrote:
> > If you want pure speed with Redis, in memory is the way to go. So as a > > reaction to the email sent by Tim about his unhappiness with VM I used > > a few vacation days to start implementing a new model, that is was was > > listed above as number "3".
> Pardon my ignorance, but I am curious as to if this is similar to the > way MongoDB works (it used memory-mapped files and fsyncs at regular > intervals).
It is not the same. MongoDB uses large files that hold both data and indexes that are memory-mapped, and thus are limited to have databases at most the architecture limit of your platform. That is, only up to about 3GB on a 32 bit machine. The method described by Salvatore stores the value pointed to by each base key in it's own file on disk, reading and (re-)writing them as necessary.
If not, what would be the disadvantages if Redis adopted such a scheme?
Given the current description of Redis diskstore, you could specify a 3 gig cache limit, but store an unlimited amount of data on disk. As long as your working-set tended below 3 gig, you would generally have very good performance. If a MongoDB-based scheme was used, you would be limited to 3GB of total data size, AND your on-disk representation must be the same as your in-memory representation (the on-disk representation for Redis structures tends to be smaller than their in-memory representation).
Thank you for taking the time and effort to design and build this system. There is one concern that have about this particular implementation, and a potential solution.
Depending on the load, some people will see a drastic reduction in IO performance when writing many different keys and files, even more so than just the random IO would suggest. Why? Filesystem metadata. Every time a file is written to, it's size, modification date, etc., all need to be updated. On a modern transactional filesystem (which we should all be using by now), this is done by first allocating the data extent, writing the data to disk, writing to a journal about how the metadata will change, then writing the actual metadata to the inode(s). This is necessary for all writes, and we have to deal with it. However, as the number of files in an individual path increase, the efficiency of the inode that describes that path decreases, and the amount of time taken to update that particular file increases. Probably not a big deal for some with a million or so keys (only 16 files on average in each directory, with an expectation of some having up to 32).
However, what I'm taking from this design description is that the desire is for Redis to be usable in systems where main memory is significantly smaller than problem sizes, so looking at a hundred million keys is not unreasonable, which pushes this to 16k files in each path on average, peaking to 32k expected.
Here's an alternate design which is similar, but which should scale out better due to a 1) reduced number of files, 2) fixed number of inode entries, 3) potential higher write throughput by aggregating more data, and 4) a method that lends itself to round-robin flushing/rewriting (you'll see why this is important later): * Data path has 256 pre-sharded hex paths (like the first-level of the described disk-store method). * Instead of up to 256 sub-directories of the higher-level 256 directories, we instead have 256 AOF files, which represent the data read/written to the 64k shards. * Handle all writes/caching exactly the same as using the described diskstore, only now there are chunks of data that are read/written together when flushed. * A 64k entry table can be used to keep track of the last time an AOF chunk was reprocessed, whether there is any dirty data, etc. * Every X seconds, a small table can be flushed to disk to describe the cache state, last modified state, etc., of shards in the system. This would allow for Redis to pre-cache before requests start coming in.
Also, because the AOFs are pre-sharded 64k ways, re-writing each of them individually to be compact should be fast.
The major drawback to this system is that if you miss on a key, you have to read out the entire chunk, which could have large unrelated data contained therein. For example, I want to the integer stored in key "foo", but there are a half-dozen 5 million entry zsets that in the same shard, and that shard isn't in cache... ouch. *
It's just some ideas. Feel free to discard it, consider it, whatever.
Regards, - Josiah
* This particular issue can be worked-around by having "automatic" and "user-defined" shard ids. If a key does not contain the "{...}" string, it is automatically sharded into the 64k shards via the 2 byte prefix of the sha1 hash. If it contains a "{....}" string, and the "...." is exactly 4 hex digits, then that is the shard-id (if it's not exactly 4 hex digits, then just that portion is the key to be passed into sha1). To handle the "but I don't want my integer value to collide with my giant zsets" issue, the "run a second Redis with a different data path" answer should suffice.
On Tue, Jan 4, 2011 at 11:57 AM, Salvatore Sanfilippo <anti...@gmail.com>wrote:
> a few months after VM started to work my feeling about it started to > be not very good. I stated in the blog, and privately on IRC, > especially talking with Pieter, that VM was not the way to go for the > future of Redis, and that the new path we were taking about using less > memory was a much better approach. Together with cluster.
> However there are a number of different models for dealing with > datasets bigger than RAM for Redis. Just to cite a few:
> 1) virtual memory, where we swap values on disk as needed (The Redis > Virtual Memory way) > 2) storing data on disk, in a complex form so that operations can be > implemented directly in the on-disk representation, and using the OS > cache as a cache layer for the working set (let's call it the Mongo DB > way) > 3) storing data on disk, but not for direct manipulation, and use > memory as a cache of objects that are active, flushing writes on disks > when this objects change.
> It is now clear that VM is not the right set of tradeoffs, it was > designed to be pretty fast but on the other hand there was a too big > price to pay for all the rest: slow restarts, slow saving, and in turn > slow replication, very complex code, and so forth.
> If you want pure speed with Redis, in memory is the way to go. So as a > reaction to the email sent by Tim about his unhappiness with VM I used > a few vacation days to start implementing a new model, that is was was > listed above as number "3".
> The new set of tradeoffs are very different. The result is called > diskstore, and this is how it works, in a few easy to digest points.
> - In diskstore key-value paris are stored on disk. > - Memory works as a cache for live objects. Operations are only > performed on in memory keys, so data on disk does not need to be > stored in complex forms. > - The cache-max-memory limit is strict. Redis will never use more RAM, > even if we have 2 MB of max memory and 1 billion of keys. This works > since now we don't need to take keys in memory. > - Data is flushed on disk asynchronously. If a key is marked as dirty, > and IO operation is scheduled for this key. > - You can control the delay between modifications of keys and disk > writes, so that if a key is modified a lot of time in small time, it > will written only one time on disk. > - Setting the delay to 0 means, sync it as fast as possible. > - All I/O is performed by a single dedicated thread, that is > long-running and not spawned on demand. The thread is awaked with a > conditional variable. > - The system is much simpler and sane than VM implementation, as there > is no need to "undo" operations on race conditions. > - Zero start-up time... as objects are loaded on demand. > - There is negative caching. If a key is not on disk we remember it > (if there is memory to do so). So we avoid accessing the disk again > and again for keys that are not there. > - The system is very fast if we access mostly our working set, and > this working set happens to fit memory. Otherwise the system is much > slower (I/O bound). > - The system does not support BGSAVE currently, but will support this, > and what is cool, with minimal overhead and used memory in the saving > child, as data on disk is already written using the same serialization > format as .rdb files. So our child will just copy files to obtain the > .rdb. In the mean time the objects in cache are not flushed, so the > system may use more memory, but it's not about copy-on-write, so it > will use very very little additional memory. > - Persistence is *PER KEY* this means, there is no point in time > persistence.
> I think that the above points may give you an idea about how it works. > But let me stress the per-key persistence point a bit.
> LPUSH a 0 > LPUSH b 1 > LPUHS a 2
> So after this commands we may have two IO scheduled operations > pending. One for "a" and one for "b". > Now imagine "a" is saved, and then the server goes down, Redis is > brutally killed, or alike. The database will contain a consistent > version of "a" and "b", but the version of "b" will be old, without > the "1" pushed.
> Also currently MULTI/EXEC is not transactional, but this will be > fixed, at least inside a multi/exec there will be guarantee that > either all values or nothing will be synched to disk (this will be > obtained using a journal file for transactions).
> Some more details. The system is composed of two layers:
> diskstore.c -- implements a trivial on disk key-value store > dscache.c -- implements the more complex caching layer
> diskstore.c is currently a filesystem based KV store. This can be > replaced with a B-TREE or something like that in the future if this > will be needed. However even if the current implementation has a big > overhead, it's pretty cool to have data as files, with very little > chances of loosing data and corruption (rename is used for writes). > But well if this does not scale well enough we'll drop it and replace > it with something better.
> The current implementations is similar to bigdis. 256 directories > containing 256 directories each are used, for a total of 65536 dir. > Every key is put inside the dirs addressed by SHA1(key) translated in > hex, for instance key "foo" is at:
> /0b/ee/0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33
> The cool thing is, diskstore.c exports a trivial interface to Redis, > so it's very simple to replace with something else without touching > too much internals.
> Stability: the system is obviously in alpha stage, however it works > pretty well, without obvious crashes. But warning, it will crash with > an assert if you try to BGSAVE.
> To try it download the "unstable" branch, edit redis.conf to enable > diskstore. Play with it. Enjoy a redis instance that starts in no time > even when it's full of data :)
> Feedbacks are really really appreciated here. I want to know what you > think, what are your impressions on the design, tradeoffs, and so > forth, how it feels when you experiment with it. If you want to see > the inner workings set log level to "debug".
> The goal is to ship 2.4 ASAP with VM replaced with a good > implementation of diskstore.
>> Pardon my ignorance, but I am curious as to if this is similar to the >> way MongoDB works (it used memory-mapped files and fsyncs at regular >> intervals).
> It is not the same. MongoDB uses large files that hold both data and indexes > that are memory-mapped, and thus are limited to have databases at most the > architecture limit of your platform. That is, only up to about 3GB on a 32 > bit machine. The method described by Salvatore stores the value pointed to > by each base key in it's own file on disk, reading and (re-)writing them as > necessary.
> It is not the same. MongoDB uses large files that hold both data and indexes > that are memory-mapped, and thus are limited to have databases at most the > architecture limit of your platform. That is, only up to about 3GB on a 32 > bit machine. The method described by Salvatore stores the value pointed to > by each base key in it's own file on disk, reading and (re-)writing them as > necessary.
As an aside, if Redis stores the value pointed by each key in its own file, what are the implications of using filesystems which limit the total number of files?
On Wed, Jan 5, 2011 at 12:17 AM, Xiangrong Fang <xrf...@gmail.com> wrote: > This is really cool. Most dataset in practice has a % of items as "hot". > This approach makes redis scale UP well. For example, we plan to use a > Dell SSD box to replace regular hard disk in production, this will > definitely help throughput I think :-)
Great, please if you have the chance of doing any testing, drop us an email with what you find :)
> On question here "the contrary" means convert diskstore into rdb format? If > so, why we still need it after the adoption of diskstore? > What is the way to do BACKUP for diskstore ? Just copy / rsync the entire > directory tree I hope?
.rdb and AOF will still be the persistence model for in-memory operations. Even in diskstore mode .rdb is good as it will be a compact consistent point-in-time dump, and we need it anyway in order to support replication with diskstore. It's like if .rdb is the lingua franka among different Redis underlying implementations, so for replication we transfer the first bulk data as .rdb regardless of the back end used by a given instance.
Cheers, Salvatore
> -- > You received this message because you are subscribed to the Google Groups > "Redis DB" group. > To post to this group, send email to redis-db@googlegroups.com. > To unsubscribe from this group, send email to > redis-db+unsubscribe@googlegroups.com. > For more options, visit this group at > http://groups.google.com/group/redis-db?hl=en.
On Wed, Jan 5, 2011 at 12:59 AM, Baishampayan Ghose <b.gh...@gmail.com>wrote:
> > It is not the same. MongoDB uses large files that hold both data and > indexes > > that are memory-mapped, and thus are limited to have databases at most > the > > architecture limit of your platform. That is, only up to about 3GB on a > 32 > > bit machine. The method described by Salvatore stores the value pointed > to > > by each base key in it's own file on disk, reading and (re-)writing them > as > > necessary.
> As an aside, if Redis stores the value pointed by each key in its own > file, what are the implications of using filesystems which limit the > total number of files?
The implication is that you can only store so many keys on disk. You aren't using fat-12, are you? ;)
On Wed, Jan 5, 2011 at 9:59 AM, Baishampayan Ghose <b.gh...@gmail.com> wrote: > As an aside, if Redis stores the value pointed by each key in its own > file, what are the implications of using filesystems which limit the > total number of files?
Most filesystems have limit in the max number of files per directory, this is why we use 65536 directories, so there is no such a problem.
Mmap is faster and will provide use a single file database, but the problem is, it's very hard to avoid corruption with this schema. For instance in MongoDB if I'm correct in order to have a decent amount of durability you need to run a replica, otherwise something bad can happen.
This is since mmaped files will flush things on disk without any ordering, and at hardware page level. So it's very hard to guess what will be written and what not on disk in case of a crash.
Even if we switch to something different than our filesystem-based approach we need something more consistent than this. It's fine for mongo as it's the unique persistence offered and must be both fast and persistence, and mongodb guys tried to find some tradeoff. But in our case for speed we have the in memory back end, and when diskstore is enabled we have the object cache, so my feeling is that our on-disk KV implementation should be able to provide some more durability.
>> As an aside, if Redis stores the value pointed by each key in its own >> file, what are the implications of using filesystems which limit the >> total number of files?
> The implication is that you can only store so many keys on disk. You aren't > using fat-12, are you? ;)
On 2011-01-04, at 20:57 , Salvatore Sanfilippo wrote:
> So as a > reaction to the email sent by Tim about his unhappiness with VM I used > a few vacation days to start implementing a new model, that is was was > listed above as number "3".
> The new set of tradeoffs are very different. The result is called > diskstore
excellent news! also, "diskstore" sounds better than "durable vm" :) i will try to give it a spin later today.
>> As an aside, if Redis stores the value pointed by each key in its own >> file, what are the implications of using filesystems which limit the >> total number of files?
> Most filesystems have limit in the max number of files per directory, > this is why we use 65536 directories, so there is no such a problem.
> Mmap is faster and will provide use a single file database, but the > problem is, it's very hard to avoid corruption with this schema. For > instance in MongoDB if I'm correct in order to have a decent amount of > durability you need to run a replica, otherwise something bad can > happen.
> This is since mmaped files will flush things on disk without any > ordering, and at hardware page level. So it's very hard to guess what > will be written and what not on disk in case of a crash.
> Even if we switch to something different than our filesystem-based > approach we need something more consistent than this. It's fine for > mongo as it's the unique persistence offered and must be both fast and > persistence, and mongodb guys tried to find some tradeoff. But in our > case for speed we have the in memory back end, and when diskstore is > enabled we have the object cache, so my feeling is that our on-disk KV > implementation should be able to provide some more durability.
Thanks for the explanation, Salvatore. Redis rocks (we use it in production) and I am very optimistic about its future :)
just a thought, Salvatore, have you considered using berkeley db (sleepycat), or even mysql? if the API is so simple, it's worth a shot giving poeple a couple of options for the "back engine" or something. I will certainly play with it over the weekend and try to use my storage engine as an alternative, but I think it's worth also exploring BDB and even MYSQL (although I'm sure the network overhead will make it much slower). the reason I didn't use BDB back then was its licensing.
On Wed, Jan 5, 2011 at 12:15 AM, Salvatore Sanfilippo <anti...@gmail.com>wrote:
> On Tue, Jan 4, 2011 at 10:36 PM, Dvir Volk <dvir...@gmail.com> wrote: > > Salvatore, this sounds great. > > I told you on twitter about my b-tree implementation, and it's funny that > I > > wrote it after having started my project with almost exactly the same > > diskstore pattern (although it was 100 sub directories per directory, > which > > is a big difference).
> Dvir right this is very interesting, 100 or 256 or ... 1000?
> This was my reasoning. I want it to work well with 1 billion keys. So:
> 1000000000/(256*256) = 15258
> 15k files per dir are more or less near to the limit to get good > performances. > On the other side to create 65536 directory, or to scan 65536 > directory for BGSAVE, seems acceptable, and works pretty fast in > practice.
> > It didn't scale well with millions of objects, and made the engine become > > very slow over time. I assume the difference between 100^2 and 256^2 will > > make it scale better though. Also, for me most reads were from disk as > > caches were very small back then, so I'm guessing it's less of an issue > if > > most of the reads are from memory.
> An important variable here is the filesystem in use for sure. > Btw there is no reason to don't try with a B-TREE, given the interface > is pretty abstract. We should give it a try at least to get the > feeling about what other solutions can provide us.
> > Back then I eventually wrote a b-tree based storage engine that worked > much > > better with millions of varying size, growing blobs (inverted indexes > > mainly).
> Cool. If you want to hack a patch it can be an interesting experiment. > In the end we'll need something able to auto compact itself if we'll > select the B-TREE approach in the end.
> > Another question is - can diskstore (optionally of course) totally > replace > > RDBs? besides the fact that backups will be slower, why not allowing > users > > to use diskstore as the only store in redis?
> It's already this way. you can *optionally* call BGSAVE when diskstore > is active (or better you will ... still not implemented), but there is > no need for it. BGSAVE will also be called for replication if our > diskstore instance is a master, but our diskstore-BGSAVE should work > very well.
> I'll try implementing it tomorrow so'll know very soon :)
> Cheers, > Salvatore
> > On Tue, Jan 4, 2011 at 9:57 PM, Salvatore Sanfilippo <anti...@gmail.com> > > wrote:
> >> Hi all,
> >> a few months after VM started to work my feeling about it started to > >> be not very good. I stated in the blog, and privately on IRC, > >> especially talking with Pieter, that VM was not the way to go for the > >> future of Redis, and that the new path we were taking about using less > >> memory was a much better approach. Together with cluster.
> >> However there are a number of different models for dealing with > >> datasets bigger than RAM for Redis. Just to cite a few:
> >> 1) virtual memory, where we swap values on disk as needed (The Redis > >> Virtual Memory way) > >> 2) storing data on disk, in a complex form so that operations can be > >> implemented directly in the on-disk representation, and using the OS > >> cache as a cache layer for the working set (let's call it the Mongo DB > >> way) > >> 3) storing data on disk, but not for direct manipulation, and use > >> memory as a cache of objects that are active, flushing writes on disks > >> when this objects change.
> >> It is now clear that VM is not the right set of tradeoffs, it was > >> designed to be pretty fast but on the other hand there was a too big > >> price to pay for all the rest: slow restarts, slow saving, and in turn > >> slow replication, very complex code, and so forth.
> >> If you want pure speed with Redis, in memory is the way to go. So as a > >> reaction to the email sent by Tim about his unhappiness with VM I used > >> a few vacation days to start implementing a new model, that is was was > >> listed above as number "3".
> >> The new set of tradeoffs are very different. The result is called > >> diskstore, and this is how it works, in a few easy to digest points.
> >> - In diskstore key-value paris are stored on disk. > >> - Memory works as a cache for live objects. Operations are only > >> performed on in memory keys, so data on disk does not need to be > >> stored in complex forms. > >> - The cache-max-memory limit is strict. Redis will never use more RAM, > >> even if we have 2 MB of max memory and 1 billion of keys. This works > >> since now we don't need to take keys in memory. > >> - Data is flushed on disk asynchronously. If a key is marked as dirty, > >> and IO operation is scheduled for this key. > >> - You can control the delay between modifications of keys and disk > >> writes, so that if a key is modified a lot of time in small time, it > >> will written only one time on disk. > >> - Setting the delay to 0 means, sync it as fast as possible. > >> - All I/O is performed by a single dedicated thread, that is > >> long-running and not spawned on demand. The thread is awaked with a > >> conditional variable. > >> - The system is much simpler and sane than VM implementation, as there > >> is no need to "undo" operations on race conditions. > >> - Zero start-up time... as objects are loaded on demand. > >> - There is negative caching. If a key is not on disk we remember it > >> (if there is memory to do so). So we avoid accessing the disk again > >> and again for keys that are not there. > >> - The system is very fast if we access mostly our working set, and > >> this working set happens to fit memory. Otherwise the system is much > >> slower (I/O bound). > >> - The system does not support BGSAVE currently, but will support this, > >> and what is cool, with minimal overhead and used memory in the saving > >> child, as data on disk is already written using the same serialization > >> format as .rdb files. So our child will just copy files to obtain the > >> .rdb. In the mean time the objects in cache are not flushed, so the > >> system may use more memory, but it's not about copy-on-write, so it > >> will use very very little additional memory. > >> - Persistence is *PER KEY* this means, there is no point in time > >> persistence.
> >> I think that the above points may give you an idea about how it works. > >> But let me stress the per-key persistence point a bit.
> >> LPUSH a 0 > >> LPUSH b 1 > >> LPUHS a 2
> >> So after this commands we may have two IO scheduled operations > >> pending. One for "a" and one for "b". > >> Now imagine "a" is saved, and then the server goes down, Redis is > >> brutally killed, or alike. The database will contain a consistent > >> version of "a" and "b", but the version of "b" will be old, without > >> the "1" pushed.
> >> Also currently MULTI/EXEC is not transactional, but this will be > >> fixed, at least inside a multi/exec there will be guarantee that > >> either all values or nothing will be synched to disk (this will be > >> obtained using a journal file for transactions).
> >> Some more details. The system is composed of two layers:
> >> diskstore.c -- implements a trivial on disk key-value store > >> dscache.c -- implements the more complex caching layer
> >> diskstore.c is currently a filesystem based KV store. This can be > >> replaced with a B-TREE or something like that in the future if this > >> will be needed. However even if the current implementation has a big > >> overhead, it's pretty cool to have data as files, with very little > >> chances of loosing data and corruption (rename is used for writes). > >> But well if this does not scale well enough we'll drop it and replace > >> it with something better.
> >> The current implementations is similar to bigdis. 256 directories > >> containing 256 directories each are used, for a total of 65536 dir. > >> Every key is put inside the dirs addressed by SHA1(key) translated in > >> hex, for instance key "foo" is at:
> >> The cool thing is, diskstore.c exports a trivial interface to Redis, > >> so it's very simple to replace with something else without touching > >> too much internals.
> >> Stability: the system is obviously in alpha stage, however it works > >> pretty well, without obvious crashes. But warning, it will crash with > >> an assert if you try to BGSAVE.
> >> To try it download the "unstable" branch, edit redis.conf to enable > >> diskstore. Play with it. Enjoy a redis instance that starts in no time > >> even when it's full of data :)
> >> Feedbacks are really really appreciated here. I want to know what you > >> think, what are your impressions on the design, tradeoffs, and so > >> forth, how it feels when you experiment with it. If you want to see > >> the inner workings set log level to "debug".
> >> The goal is to ship 2.4 ASAP with VM replaced with a good > >> implementation of diskstore.
> >> "We are what we repeatedly do. Excellence, therefore, is not an act, > >> but a habit." -- Aristotele
> >> -- > >> You received this message because you are subscribed to the Google > Groups > >> "Redis DB" group. > >> To post to this group, send email to redis-db@googlegroups.com. > >> To
On Wed, Jan 5, 2011 at 11:36 AM, Dvir Volk <dvir...@gmail.com> wrote: > just a thought, Salvatore, > have you considered using berkeley db (sleepycat), or even mysql? > if the API is so simple, it's worth a shot giving poeple a couple of options > for the "back engine" or something. > I will certainly play with it over the weekend and try to use my storage > engine as an alternative, but I think it's worth also exploring BDB and even > MYSQL (although I'm sure the network overhead will make it much slower). > the reason I didn't use BDB back then was its licensing.
Supporting multiple back ends surely makes sense. On the other hand I want to have a clean build system with a clean default, so this is what we can do.
For default we'll have something like:
diskstore-plugin native
Or...
diskstore-plugin /path/to/your/plugin
This plugin will just be popen()-ed by Redis, so Redis will talk to this plugin via standard input/output, and will issue commands like GET / SET / and DEL, with a simple binary protocol. The plugin will reply writing to standard output, again with a simple binary protocol.
Given that anyway disk I/O is involved the overhead of the stdin/out chat is not a big issue, and this way people can create plugins for diskstore with any conceivable language and architecture. At the same time Redis will ship what a default that we consider sane for most users.
i think there are even more interesting options -- how about something clustered, like riak for example? this would take care of storing multiple copies, thus guarding against disk failures, and making failover much easier.
> just a thought, Salvatore, > have you considered using berkeley db (sleepycat), or even mysql? > if the API is so simple, it's worth a shot giving poeple a couple of options for the "back engine" or something. > I will certainly play with it over the weekend and try to use my storage engine as an alternative, but I think it's worth also exploring BDB and even MYSQL (although I'm sure the network overhead will make it much slower). > the reason I didn't use BDB back then was its licensing.
> On Wed, Jan 5, 2011 at 12:15 AM, Salvatore Sanfilippo <anti...@gmail.com> wrote: > On Tue, Jan 4, 2011 at 10:36 PM, Dvir Volk <dvir...@gmail.com> wrote: > > Salvatore, this sounds great. > > I told you on twitter about my b-tree implementation, and it's funny that I > > wrote it after having started my project with almost exactly the same > > diskstore pattern (although it was 100 sub directories per directory, which > > is a big difference).
> Dvir right this is very interesting, 100 or 256 or ... 1000?
> This was my reasoning. I want it to work well with 1 billion keys. So:
> 1000000000/(256*256) = 15258
> 15k files per dir are more or less near to the limit to get good performances. > On the other side to create 65536 directory, or to scan 65536 > directory for BGSAVE, seems acceptable, and works pretty fast in > practice.
> > It didn't scale well with millions of objects, and made the engine become > > very slow over time. I assume the difference between 100^2 and 256^2 will > > make it scale better though. Also, for me most reads were from disk as > > caches were very small back then, so I'm guessing it's less of an issue if > > most of the reads are from memory.
> An important variable here is the filesystem in use for sure. > Btw there is no reason to don't try with a B-TREE, given the interface > is pretty abstract. We should give it a try at least to get the > feeling about what other solutions can provide us.
> > Back then I eventually wrote a b-tree based storage engine that worked much > > better with millions of varying size, growing blobs (inverted indexes > > mainly).
> Cool. If you want to hack a patch it can be an interesting experiment. > In the end we'll need something able to auto compact itself if we'll > select the B-TREE approach in the end.
> > Another question is - can diskstore (optionally of course) totally replace > > RDBs? besides the fact that backups will be slower, why not allowing users > > to use diskstore as the only store in redis?
> It's already this way. you can *optionally* call BGSAVE when diskstore > is active (or better you will ... still not implemented), but there is > no need for it. BGSAVE will also be called for replication if our > diskstore instance is a master, but our diskstore-BGSAVE should work > very well.
> I'll try implementing it tomorrow so'll know very soon :)
> Cheers, > Salvatore
> > On Tue, Jan 4, 2011 at 9:57 PM, Salvatore Sanfilippo <anti...@gmail.com> > > wrote:
> >> Hi all,
> >> a few months after VM started to work my feeling about it started to > >> be not very good. I stated in the blog, and privately on IRC, > >> especially talking with Pieter, that VM was not the way to go for the > >> future of Redis, and that the new path we were taking about using less > >> memory was a much better approach. Together with cluster.
> >> However there are a number of different models for dealing with > >> datasets bigger than RAM for Redis. Just to cite a few:
> >> 1) virtual memory, where we swap values on disk as needed (The Redis > >> Virtual Memory way) > >> 2) storing data on disk, in a complex form so that operations can be > >> implemented directly in the on-disk representation, and using the OS > >> cache as a cache layer for the working set (let's call it the Mongo DB > >> way) > >> 3) storing data on disk, but not for direct manipulation, and use > >> memory as a cache of objects that are active, flushing writes on disks > >> when this objects change.
> >> It is now clear that VM is not the right set of tradeoffs, it was > >> designed to be pretty fast but on the other hand there was a too big > >> price to pay for all the rest: slow restarts, slow saving, and in turn > >> slow replication, very complex code, and so forth.
> >> If you want pure speed with Redis, in memory is the way to go. So as a > >> reaction to the email sent by Tim about his unhappiness with VM I used > >> a few vacation days to start implementing a new model, that is was was > >> listed above as number "3".
> >> The new set of tradeoffs are very different. The result is called > >> diskstore, and this is how it works, in a few easy to digest points.
> >> - In diskstore key-value paris are stored on disk. > >> - Memory works as a cache for live objects. Operations are only > >> performed on in memory keys, so data on disk does not need to be > >> stored in complex forms. > >> - The cache-max-memory limit is strict. Redis will never use more RAM, > >> even if we have 2 MB of max memory and 1 billion of keys. This works > >> since now we don't need to take keys in memory. > >> - Data is flushed on disk asynchronously. If a key is marked as dirty, > >> and IO operation is scheduled for this key. > >> - You can control the delay between modifications of keys and disk > >> writes, so that if a key is modified a lot of time in small time, it > >> will written only one time on disk. > >> - Setting the delay to 0 means, sync it as fast as possible. > >> - All I/O is performed by a single dedicated thread, that is > >> long-running and not spawned on demand. The thread is awaked with a > >> conditional variable. > >> - The system is much simpler and sane than VM implementation, as there > >> is no need to "undo" operations on race conditions. > >> - Zero start-up time... as objects are loaded on demand. > >> - There is negative caching. If a key is not on disk we remember it > >> (if there is memory to do so). So we avoid accessing the disk again > >> and again for keys that are not there. > >> - The system is very fast if we access mostly our working set, and > >> this working set happens to fit memory. Otherwise the system is much > >> slower (I/O bound). > >> - The system does not support BGSAVE currently, but will support this, > >> and what is cool, with minimal overhead and used memory in the saving > >> child, as data on disk is already written using the same serialization > >> format as .rdb files. So our child will just copy files to obtain the > >> .rdb. In the mean time the objects in cache are not flushed, so the > >> system may use more memory, but it's not about copy-on-write, so it > >> will use very very little additional memory. > >> - Persistence is *PER KEY* this means, there is no point in time > >> persistence.
> >> I think that the above points may give you an idea about how it works. > >> But let me stress the per-key persistence point a bit.
> >> LPUSH a 0 > >> LPUSH b 1 > >> LPUHS a 2
> >> So after this commands we may have two IO scheduled operations > >> pending. One for "a" and one for "b". > >> Now imagine "a" is saved, and then the server goes down, Redis is > >> brutally killed, or alike. The database will contain a consistent > >> version of "a" and "b", but the version of "b" will be old, without > >> the "1" pushed.
> >> Also currently MULTI/EXEC is not transactional, but this will be > >> fixed, at least inside a multi/exec there will be guarantee that > >> either all values or nothing will be synched to disk (this will be > >> obtained using a journal file for transactions).
> >> Some more details. The system is composed of two layers:
> >> diskstore.c -- implements a trivial on disk key-value store > >> dscache.c -- implements the more complex caching layer
> >> diskstore.c is currently a filesystem based KV store. This can be > >> replaced with a B-TREE or something like that in the future if this > >> will be needed. However even if the current implementation has a big > >> overhead, it's pretty cool to have data as files, with very little > >> chances of loosing data and corruption (rename is used for writes). > >> But well if this does not scale well enough we'll drop it and replace > >> it with something better.
> >> The current implementations is similar to bigdis. 256 directories > >> containing 256 directories each are used, for a total of 65536 dir. > >> Every key is put inside the dirs addressed by SHA1(key) translated in > >> hex, for instance key "foo" is at:
> >> The cool thing is, diskstore.c exports a trivial interface to Redis, > >> so it's very simple to replace with something else without touching > >> too much internals.
> >> Stability: the system is obviously in alpha stage, however it works > >> pretty well, without obvious crashes. But warning, it will crash with > >> an assert if you try to BGSAVE.
> >> To try it download the "unstable" branch, edit redis.conf to enable > >> diskstore. Play with it. Enjoy a redis instance that starts in no time > >> even when it's full of data :)
> >> Feedbacks are really really appreciated here. I want to know what you > >> think, what are your impressions on the design, tradeoffs, and so > >> forth, how it feels when you experiment with it. If you want to see > >> the inner workings set log level to "debug".
> >> The goal is to ship 2.4 ASAP with VM replaced with a good > >> implementation of diskstore.
On Wed, Jan 5, 2011 at 10:36 AM, Dvir Volk <dvir...@gmail.com> wrote: > just a thought, Salvatore, > have you considered using berkeley db (sleepycat), or even mysql?
Or the InnoDB plugin directly, given that it seems to have good performance: