Redis diskstore

4,776 views
Skip to first unread message

Salvatore Sanfilippo

unread,
Jan 4, 2011, 2:57:53 PM1/4/11
to Redis DB
Hi all,

a few months after VM started to work my feeling about it started to
be not very good. I stated in the blog, and privately on IRC,
especially talking with Pieter, that VM was not the way to go for the
future of Redis, and that the new path we were taking about using less
memory was a much better approach. Together with cluster.

However there are a number of different models for dealing with
datasets bigger than RAM for Redis. Just to cite a few:

1) virtual memory, where we swap values on disk as needed (The Redis
Virtual Memory way)
2) storing data on disk, in a complex form so that operations can be
implemented directly in the on-disk representation, and using the OS
cache as a cache layer for the working set (let's call it the Mongo DB
way)
3) storing data on disk, but not for direct manipulation, and use
memory as a cache of objects that are active, flushing writes on disks
when this objects change.

It is now clear that VM is not the right set of tradeoffs, it was
designed to be pretty fast but on the other hand there was a too big
price to pay for all the rest: slow restarts, slow saving, and in turn
slow replication, very complex code, and so forth.

If you want pure speed with Redis, in memory is the way to go. So as a
reaction to the email sent by Tim about his unhappiness with VM I used
a few vacation days to start implementing a new model, that is was was
listed above as number "3".

The new set of tradeoffs are very different. The result is called
diskstore, and this is how it works, in a few easy to digest points.

- In diskstore key-value paris are stored on disk.
- Memory works as a cache for live objects. Operations are only
performed on in memory keys, so data on disk does not need to be
stored in complex forms.
- The cache-max-memory limit is strict. Redis will never use more RAM,
even if we have 2 MB of max memory and 1 billion of keys. This works
since now we don't need to take keys in memory.
- Data is flushed on disk asynchronously. If a key is marked as dirty,
and IO operation is scheduled for this key.
- You can control the delay between modifications of keys and disk
writes, so that if a key is modified a lot of time in small time, it
will written only one time on disk.
- Setting the delay to 0 means, sync it as fast as possible.
- All I/O is performed by a single dedicated thread, that is
long-running and not spawned on demand. The thread is awaked with a
conditional variable.
- The system is much simpler and sane than VM implementation, as there
is no need to "undo" operations on race conditions.
- Zero start-up time... as objects are loaded on demand.
- There is negative caching. If a key is not on disk we remember it
(if there is memory to do so). So we avoid accessing the disk again
and again for keys that are not there.
- The system is very fast if we access mostly our working set, and
this working set happens to fit memory. Otherwise the system is much
slower (I/O bound).
- The system does not support BGSAVE currently, but will support this,
and what is cool, with minimal overhead and used memory in the saving
child, as data on disk is already written using the same serialization
format as .rdb files. So our child will just copy files to obtain the
.rdb. In the mean time the objects in cache are not flushed, so the
system may use more memory, but it's not about copy-on-write, so it
will use very very little additional memory.
- Persistence is *PER KEY* this means, there is no point in time persistence.

I think that the above points may give you an idea about how it works.
But let me stress the per-key persistence point a bit.

LPUSH a 0
LPUSH b 1
LPUHS a 2

So after this commands we may have two IO scheduled operations
pending. One for "a" and one for "b".
Now imagine "a" is saved, and then the server goes down, Redis is
brutally killed, or alike. The database will contain a consistent
version of "a" and "b", but the version of "b" will be old, without
the "1" pushed.

Also currently MULTI/EXEC is not transactional, but this will be
fixed, at least inside a multi/exec there will be guarantee that
either all values or nothing will be synched to disk (this will be
obtained using a journal file for transactions).

Some more details. The system is composed of two layers:

diskstore.c -- implements a trivial on disk key-value store
dscache.c -- implements the more complex caching layer

diskstore.c is currently a filesystem based KV store. This can be
replaced with a B-TREE or something like that in the future if this
will be needed. However even if the current implementation has a big
overhead, it's pretty cool to have data as files, with very little
chances of loosing data and corruption (rename is used for writes).
But well if this does not scale well enough we'll drop it and replace
it with something better.

The current implementations is similar to bigdis. 256 directories
containing 256 directories each are used, for a total of 65536 dir.
Every key is put inside the dirs addressed by SHA1(key) translated in
hex, for instance key "foo" is at:

/0b/ee/0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33

The cool thing is, diskstore.c exports a trivial interface to Redis,
so it's very simple to replace with something else without touching
too much internals.

Stability: the system is obviously in alpha stage, however it works
pretty well, without obvious crashes. But warning, it will crash with
an assert if you try to BGSAVE.

To try it download the "unstable" branch, edit redis.conf to enable
diskstore. Play with it. Enjoy a redis instance that starts in no time
even when it's full of data :)

Feedbacks are really really appreciated here. I want to know what you
think, what are your impressions on the design, tradeoffs, and so
forth, how it feels when you experiment with it. If you want to see
the inner workings set log level to "debug".

The goal is to ship 2.4 ASAP with VM replaced with a good
implementation of diskstore.

Cheers,
Salvatore

--
Salvatore 'antirez' Sanfilippo
http://invece.org

"We are what we repeatedly do. Excellence, therefore, is not an act,
but a habit." -- Aristotele

Andy McCurdy

unread,
Jan 4, 2011, 3:49:52 PM1/4/11
to redi...@googlegroups.com
Hey Salvatore,

I've definitely had issues with VM, and am excited to see how diskstore works out. In theory, the tradeoffs seem much better. Couple questions.

What happens if I have a producer (or perhaps many producers) of data that add dirty data to Redis faster than the Diskstore IO thread can write it? Specifically, what if the dirty data approaches the max memory limits? Will the memory footprint grow temporarily until those values are persisted, then free itself?

What about if I have a large data structure? One with thousands of items in a list or zset. If this key is written to frequently (multiple times a second), will the IO thread get backed up by having to write such a large structure over and over again?

-andy


--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.


andrychris

unread,
Jan 4, 2011, 3:50:26 PM1/4/11
to Redis DB
To share some quick, high level thoughts on my initial tests and what
I've seen so far...

I've been starting off with a basic test that essentially spams simple
SET requests to a redis instance with small (1-30 bytes or so) keys
and values as fast as it can go. I've been using a 2-cpu (for a total
of 16 cores) Xeon X5570 with 24GB of RAM and, at the moment, a solid
state drive that I'm using for long term storage.

* Once redis reaches the cache RAM limit, the machine starts thrashing
pretty bad, and ends up spending a good deal of time waiting for disk
i/o and generally slowing redis to a halt. The SETs per second at
that point quickly drops of from about 25-26k to 10k, and then to
nothing as it thrashes back and forth.
* I haven't spent as much time testing this out, but increasing cache-
flush-delay seems to decrease the "hit" that occurs once we reach the
RAM limit, but that success is short lived :)

I will freely admit that my test is a little unfair at best, and
unrealistic at worst -- obviously in a production scenario I wouldn't
be using redis with a disk backing store simply to constantly issue
SET requests. However, I've been checking out various options to do
more or less exactly what the future redis v2.4 aims to do, and "what
happens when you run out of RAM" is a question that I need to answer,
and the answer, in most cases, seems to be "noooooo!!!!!!!" :). Not
entirely surprising given the stress I'm putting on it, but I need to
know what happens in worst case scenarios.

That being said, while having something that magically experienced
zero performance degradation when writing to disk instead of RAM would
be just lovely, I'm not gonna hold my breath :). I am looking for
consistent and predictable performance characteristics, however. At
the moment, redis effectively locks up if it's under heavy load, has
reached the cache size limit, and the db is continuing to grow.

I'll be continuing some of my testing in the near future to include
some scenarios that are a bit more realistic, mixing GET/SET/etc.
requests in certain ratios and so forth, using things that are at
least closer to "real world" data (for us, anyway).

Depending on how far I get, I might offer up a branch with a few
tweaks of my own, but that depends largely on how much time I can
devote to it in the near future :).

To close, I'll just say that for being as new and early stage as the
branch formerly known as 'diskstore' is, it's looking pretty good.
Thanks for all the hard work!

-Chris
> Salvatore 'antirez' Sanfilippohttp://invece.org

Michael Russo

unread,
Jan 4, 2011, 4:24:04 PM1/4/11
to redi...@googlegroups.com
A few questions:

- Is support for loading RDB files planned (when diskstore is enabled)?  If so, there will be a start-up delay as the RDB is re-written to the diskstore format, correct?

- Are there any ideas to change the process for starting new slaves, to address some of the existing shortcomings with replication, or do we think that the improved speed and lower memory requirements associated with RDB generation will negate the most serious concerns?

Overall, this looks very exciting, as it opens up even more use cases for Redis.

Thanks,
Michael

Salvatore Sanfilippo

unread,
Jan 4, 2011, 4:29:54 PM1/4/11
to redi...@googlegroups.com
On Tue, Jan 4, 2011 at 9:49 PM, Andy McCurdy <sed...@gmail.com> wrote:
> Hey Salvatore,
> I've definitely had issues with VM, and am excited to see how diskstore
> works out. In theory, the tradeoffs seem much better. Couple questions.

Hello Andy, thanks for your feedbacks and your questions, they are
very interesting.

> What happens if I have a producer (or perhaps many producers) of data that
> add dirty data to Redis faster than the Diskstore IO thread can write it?
> Specifically, what if the dirty data approaches the max memory limits? Will
> the memory footprint grow temporarily until those values are persisted, then
> free itself?

Redis will start evicting keys that are not dirty from the cache, but
when it eventually runs completely out of memory compared to the max
memory limit, it will start blocking waiting for disk I/O to complete
so that memory can be reclaimed. Currently this is done not
incrementally enough and will be improved, but there are no other
solutions.

This basically means that redis will survive without problems to
peaks, but if the writes are sustainedly faster than IO eventually
there is to wait for IO to compelte :)

But I think it's a good idea that the new semantics is to use
cache-max-memory as an *hard* limit.
The only reason it Redis will use more memory than that is to process
a command that involves using values that are larger than the max
memory setting itself (for instance you set maxmemory to 1mb and then
do an lpush against a list that in memory is 2mb, but this is not a
problem I think).

> What about if I have a large data structure? One with thousands of items in
> a list or zset. If this key is written to frequently (multiple times a
> second), will the IO thread get backed up by having to write such a large
> structure over and over again?

It is mainly up to you, as there is a setting called cache-flush-delay.
If you set it to, for instance, 10, then the first time you write
against the big zset it will get queued for disk IO, with time for
completion set to now+10 seconds. The next writes will not add new
writes to the queue, so after 10 seconds the key will be transfered on
disk, and then if there is a new operation against this key, it will
be queued again. The effect is that the key will be written on disk
every 10 seconds if there is a continuos stream of writes against it.
So it's a clear trade-off between durability and speed.

I want to point out that for things like leader boards and big sorted
sets, lists, ...., the way to go is to use Redis with the default
in-memory back end. I guess that in many conditions it will make a lot
of sense to run two Redis instances: one for bulk data, and one for
fast things like the leader board and other things that must be very
fast.

Cheers,
Salvatore

Dvir Volk

unread,
Jan 4, 2011, 4:36:32 PM1/4/11
to redi...@googlegroups.com
Salvatore, this sounds great.
I told you on twitter about my b-tree implementation, and it's funny that I wrote it after having started my project with almost exactly the same diskstore pattern (although it was 100 sub directories per directory, which is a big difference).
It didn't scale well with millions of objects, and made the engine become very slow over time. I assume the difference between 100^2 and 256^2 will make it scale better though. Also, for me most reads were from disk as caches were very small back then, so I'm guessing it's less of an issue if most of the reads are from memory.
Back then I eventually wrote a b-tree based storage engine that worked much better with millions of varying size, growing blobs (inverted indexes mainly).

Another question is - can diskstore (optionally of course) totally replace RDBs? besides the fact that backups will be slower, why not allowing users to use diskstore as the only store in redis?

Salvatore Sanfilippo

unread,
Jan 4, 2011, 5:06:30 PM1/4/11
to redi...@googlegroups.com
On Tue, Jan 4, 2011 at 9:50 PM, andrychris <ch...@noneofyo.biz> wrote:

Thank you for your comments Chris! Replying below.

> * Once redis reaches the cache RAM limit, the machine starts thrashing
> pretty bad, and ends up spending a good deal of time waiting for disk
> i/o and generally slowing redis to a halt.  The SETs per second at
> that point quickly drops of from about 25-26k to 10k, and then to
> nothing as it thrashes back and forth.

Yes, this is expected behavior but for the fact that currently when we
reach the condition that there is no memory to create a longer write
queue the whole current queue is processed in a blocking way. Instead
I need to make this incremental. It's very easy, just the current
implementation was a two minutes effort to make it working without
surpassing the memory limit. But whatever we can do of course when
there are too many writes we'll start waiting for I/O.

> * I haven't spent as much time testing this out, but increasing cache-
> flush-delay seems to decrease the "hit" that occurs once we reach the
> RAM limit, but that success is short lived :)

Increasing cache-flush-delay and cache-max-memory helps in two different ways:

1) increasing cache-flush-delay helps a lot if you write again and again again
2) increasing cache-max-memory helps to handle well a peak that will
last for more time.

But if the write-heavy workload continues for too long in the end
performances will start to get lower and lower, until they'll match
the performance of the I/O subsystem.

> I will freely admit that my test is a little unfair at best, and
> unrealistic at worst -- obviously in a production scenario I wouldn't
> be using redis with a disk backing store simply to constantly issue
> SET requests.  However, I've been checking out various options to do
> more or less exactly what the future redis v2.4 aims to do, and "what
> happens when you run out of RAM" is a question that I need to answer,
> and the answer, in most cases, seems to be "noooooo!!!!!!!" :).  Not
> entirely surprising given the stress I'm putting on it, but I need to
> know what happens in worst case scenarios.

Completely agree with your vision, the worst case scenario must be very clear.
And of course we can improve the I/O performance itself.

For instance, serializing / deserializing hashes / lists / sets is
slow, but now that we have specially encoded data type that are...
guess what? A linear array of data! We can serialize it on disk just
by copying.

Many optimizations are possible.

> That being said, while having something that magically experienced
> zero performance degradation when writing to disk instead of RAM would
> be just lovely, I'm not gonna hold my breath :).  I am looking for
> consistent and predictable performance characteristics, however.  At
> the moment, redis effectively locks up if it's under heavy load, has
> reached the cache size limit, and the db is continuing to grow.

Yes now it consumes the whole queue of 200 writes, every time... not
exactly a good idea.
Will fix it soon.

> I'll be continuing some of my testing in the near future to include
> some scenarios that are a bit more realistic, mixing GET/SET/etc.
> requests in certain ratios and so forth, using things that are at
> least closer to "real world" data (for us, anyway).

Great

> Depending on how far I get, I might offer up a branch with a few
> tweaks of my own, but that depends largely on how much time I can
> devote to it in the near future :).

Thanks

> To close, I'll just say that for being as new and early stage as the
> branch formerly known as 'diskstore' is, it's looking pretty good.
> Thanks for all the hard work!

Thanks! Very useful feedbacks.

Cheers,
Salvatore

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>

--
Salvatore 'antirez' Sanfilippo

Salvatore Sanfilippo

unread,
Jan 4, 2011, 5:08:32 PM1/4/11
to redi...@googlegroups.com
On Tue, Jan 4, 2011 at 10:24 PM, Michael Russo <mjr...@gmail.com> wrote:
> A few questions:
> - Is support for loading RDB files planned (when diskstore is enabled)?  If
> so, there will be a start-up delay as the RDB is re-written to the diskstore
> format, correct?

The idea is to provide a tool to convert an .rdb file into diskstore format.
The contrary will be handled natively by redis, just calling BGSAVE.

> - Are there any ideas to change the process for starting new slaves, to
> address some of the existing shortcomings with replication, or do we think
> that the improved speed and lower memory requirements associated with RDB
> generation will negate the most serious concerns?

I think that BGSAVE will be a very efficient operation with diskstore,
so will not create the problems that created when VM was active. We'll
see how the actual implementation works, but I'm very positive about
this.

> Overall, this looks very exciting, as it opens up even more use cases for
> Redis.

Thanks!
Salvatore

Salvatore Sanfilippo

unread,
Jan 4, 2011, 5:15:17 PM1/4/11
to redi...@googlegroups.com
On Tue, Jan 4, 2011 at 10:36 PM, Dvir Volk <dvi...@gmail.com> wrote:
> Salvatore, this sounds great.
> I told you on twitter about my b-tree implementation, and it's funny that I
> wrote it after having started my project with almost exactly the same
> diskstore pattern (although it was 100 sub directories per directory, which
> is a big difference).

Dvir right this is very interesting, 100 or 256 or ... 1000?

This was my reasoning. I want it to work well with 1 billion keys. So:

1000000000/(256*256) = 15258

15k files per dir are more or less near to the limit to get good performances.
On the other side to create 65536 directory, or to scan 65536
directory for BGSAVE, seems acceptable, and works pretty fast in
practice.

> It didn't scale well with millions of objects, and made the engine become
> very slow over time. I assume the difference between 100^2 and 256^2 will
> make it scale better though. Also, for me most reads were from disk as
> caches were very small back then, so I'm guessing it's less of an issue if
> most of the reads are from memory.

An important variable here is the filesystem in use for sure.
Btw there is no reason to don't try with a B-TREE, given the interface
is pretty abstract. We should give it a try at least to get the
feeling about what other solutions can provide us.

> Back then I eventually wrote a b-tree based storage engine that worked much
> better with millions of varying size, growing blobs (inverted indexes
> mainly).

Cool. If you want to hack a patch it can be an interesting experiment.
In the end we'll need something able to auto compact itself if we'll
select the B-TREE approach in the end.

> Another question is - can diskstore (optionally of course) totally replace
> RDBs? besides the fact that backups will be slower, why not allowing users
> to use diskstore as the only store in redis?

It's already this way. you can *optionally* call BGSAVE when diskstore
is active (or better you will ... still not implemented), but there is
no need for it. BGSAVE will also be called for replication if our
diskstore instance is a master, but our diskstore-BGSAVE should work
very well.

I'll try implementing it tomorrow so'll know very soon :)

Cheers,
Salvatore

Xiangrong Fang

unread,
Jan 4, 2011, 6:17:33 PM1/4/11
to redi...@googlegroups.com
This is really cool.  Most dataset in practice has a % of items as "hot".  This approach makes redis scale UP well.  For example, we plan to use a Dell SSD box to replace regular hard disk in production, this will definitely help throughput I think :-)

> - Is support for loading RDB files planned (when diskstore is enabled)?  > If so, there will be a start-up delay as the RDB is re-written to the
> diskstore format, correct?

The idea is to provide a tool to convert an .rdb file into diskstore format.
The contrary will be handled natively by redis, just calling BGSAVE.

On question here "the contrary" means convert diskstore into rdb format? If so, why we still need it after the adoption of diskstore?  

What is the way to do BACKUP for diskstore ? Just copy / rsync the entire directory tree I hope?

Baishampayan Ghose

unread,
Jan 5, 2011, 1:43:09 AM1/5/11
to redi...@googlegroups.com
> If you want pure speed with Redis, in memory is the way to go. So as a
> reaction to the email sent by Tim about his unhappiness with VM I used
> a few vacation days to start implementing a new model, that is was was
> listed above as number "3".

Pardon my ignorance, but I am curious as to if this is similar to the
way MongoDB works (it used memory-mapped files and fsyncs at regular
intervals).

If not, what would be the disadvantages if Redis adopted such a scheme?

Regards,
BG

--
Baishampayan Ghose
b.ghose at gmail.com

Josiah Carlson

unread,
Jan 5, 2011, 2:57:36 AM1/5/11
to redi...@googlegroups.com
On Tue, Jan 4, 2011 at 10:43 PM, Baishampayan Ghose <b.g...@gmail.com> wrote:
> If you want pure speed with Redis, in memory is the way to go. So as a
> reaction to the email sent by Tim about his unhappiness with VM I used
> a few vacation days to start implementing a new model, that is was was
> listed above as number "3".

Pardon my ignorance, but I am curious as to if this is similar to the
way MongoDB works (it used memory-mapped files and fsyncs at regular
intervals).

It is not the same. MongoDB uses large files that hold both data and indexes that are memory-mapped, and thus are limited to have databases at most the architecture limit of your platform. That is, only up to about 3GB on a 32 bit machine. The method described by Salvatore stores the value pointed to by each base key in it's own file on disk, reading and (re-)writing them as necessary.

If not, what would be the disadvantages if Redis adopted such a scheme?

Given the current description of Redis diskstore, you could specify a 3 gig cache limit, but store an unlimited amount of data on disk. As long as your working-set tended below 3 gig, you would generally have very good performance. If a MongoDB-based scheme was used, you would be limited to 3GB of total data size, AND your on-disk representation must be the same as your in-memory representation (the on-disk representation for Redis structures tends to be smaller than their in-memory representation).

Regards,
 - Josiah

Josiah Carlson

unread,
Jan 5, 2011, 3:46:27 AM1/5/11
to redi...@googlegroups.com
Salvatore,

Thank you for taking the time and effort to design and build this system. There is one concern that have about this particular implementation, and a potential solution.

Depending on the load, some people will see a drastic reduction in IO performance when writing many different keys and files, even more so than just the random IO would suggest. Why? Filesystem metadata. Every time a file is written to, it's size, modification date, etc., all need to be updated. On a modern transactional filesystem (which we should all be using by now), this is done by first allocating the data extent, writing the data to disk, writing to a journal about how the metadata will change, then writing the actual metadata to the inode(s). This is necessary for all writes, and we have to deal with it.  However, as the number of files in an individual path increase, the efficiency of the inode that describes that path decreases, and the amount of time taken to update that particular file increases.  Probably not a big deal for some with a million or so keys (only 16 files on average in each directory, with an expectation of some having up to 32).

However, what I'm taking from this design description is that the desire is for Redis to be usable in systems where main memory is significantly smaller than problem sizes, so looking at a hundred million keys is not unreasonable, which pushes this to 16k files in each path on average, peaking to 32k expected.

Here's an alternate design which is similar, but which should scale out better due to a 1) reduced number of files, 2) fixed number of inode entries, 3) potential higher write throughput by aggregating more data, and 4) a method that lends itself to round-robin flushing/rewriting (you'll see why this is important later):
* Data path has 256 pre-sharded hex paths (like the first-level of the described disk-store method).
* Instead of up to 256 sub-directories of the higher-level 256 directories, we instead have 256 AOF files, which represent the data read/written to the 64k shards.
* Handle all writes/caching exactly the same as using the described diskstore, only now there are chunks of data that are read/written together when flushed.
* A 64k entry table can be used to keep track of the last time an AOF chunk was reprocessed, whether there is any dirty data, etc.
* Every X seconds, a small table can be flushed to disk to describe the cache state, last modified state, etc., of shards in the system. This would allow for Redis to pre-cache before requests start coming in.

Also, because the AOFs are pre-sharded 64k ways, re-writing each of them individually to be compact should be fast.

The major drawback to this system is that if you miss on a key, you have to read out the entire chunk, which could have large unrelated data contained therein. For example, I want to the integer stored in key "foo", but there are a half-dozen 5 million entry zsets that in the same shard, and that shard isn't in cache... ouch. *

It's just some ideas. Feel free to discard it, consider it, whatever.

Regards,
 - Josiah

* This particular issue can be worked-around by having "automatic" and "user-defined" shard ids. If a key does not contain the "{...}" string, it is automatically sharded into the 64k shards via the 2 byte prefix of the sha1 hash. If it contains a "{....}" string, and the "...." is exactly 4 hex digits, then that is the shard-id (if it's not exactly 4 hex digits, then just that portion is the key to be passed into sha1). To handle the "but I don't want my integer value to collide with my giant zsets" issue, the "run a second Redis with a different data path" answer should suffice.

On Tue, Jan 4, 2011 at 11:57 AM, Salvatore Sanfilippo <ant...@gmail.com> wrote:

Baishampayan Ghose

unread,
Jan 5, 2011, 3:55:25 AM1/5/11
to redi...@googlegroups.com
>> Pardon my ignorance, but I am curious as to if this is similar to the
>> way MongoDB works (it used memory-mapped files and fsyncs at regular
>> intervals).
>
> It is not the same. MongoDB uses large files that hold both data and indexes
> that are memory-mapped, and thus are limited to have databases at most the
> architecture limit of your platform. That is, only up to about 3GB on a 32
> bit machine. The method described by Salvatore stores the value pointed to
> by each base key in it's own file on disk, reading and (re-)writing them as
> necessary.

Thanks Josiah!

Baishampayan Ghose

unread,
Jan 5, 2011, 3:59:13 AM1/5/11
to redi...@googlegroups.com
> It is not the same. MongoDB uses large files that hold both data and indexes
> that are memory-mapped, and thus are limited to have databases at most the
> architecture limit of your platform. That is, only up to about 3GB on a 32
> bit machine. The method described by Salvatore stores the value pointed to
> by each base key in it's own file on disk, reading and (re-)writing them as
> necessary.

As an aside, if Redis stores the value pointed by each key in its own
file, what are the implications of using filesystems which limit the
total number of files?

Salvatore Sanfilippo

unread,
Jan 5, 2011, 4:06:22 AM1/5/11
to redi...@googlegroups.com
On Wed, Jan 5, 2011 at 12:17 AM, Xiangrong Fang <xrf...@gmail.com> wrote:
> This is really cool.  Most dataset in practice has a % of items as "hot".
>  This approach makes redis scale UP well.  For example, we plan to use a
> Dell SSD box to replace regular hard disk in production, this will
> definitely help throughput I think :-)

Great, please if you have the chance of doing any testing, drop us an
email with what you find :)

> On question here "the contrary" means convert diskstore into rdb format? If
> so, why we still need it after the adoption of diskstore?
> What is the way to do BACKUP for diskstore ? Just copy / rsync the entire
> directory tree I hope?

.rdb and AOF will still be the persistence model for in-memory operations.
Even in diskstore mode .rdb is good as it will be a compact consistent
point-in-time dump, and we need it anyway in order to support
replication with diskstore. It's like if .rdb is the lingua franka
among different Redis underlying implementations, so for replication
we transfer the first bulk data as .rdb regardless of the back end
used by a given instance.

Cheers,
Salvatore

Josiah Carlson

unread,
Jan 5, 2011, 4:07:58 AM1/5/11
to redi...@googlegroups.com
The implication is that you can only store so many keys on disk. You aren't using fat-12, are you? ;)

 - Josiah

Salvatore Sanfilippo

unread,
Jan 5, 2011, 4:10:20 AM1/5/11
to redi...@googlegroups.com
On Wed, Jan 5, 2011 at 9:59 AM, Baishampayan Ghose <b.g...@gmail.com> wrote:

> As an aside, if Redis stores the value pointed by each key in its own
> file, what are the implications of using filesystems which limit the
> total number of files?

Most filesystems have limit in the max number of files per directory,
this is why we use 65536 directories, so there is no such a problem.

Mmap is faster and will provide use a single file database, but the
problem is, it's very hard to avoid corruption with this schema. For
instance in MongoDB if I'm correct in order to have a decent amount of
durability you need to run a replica, otherwise something bad can
happen.

This is since mmaped files will flush things on disk without any
ordering, and at hardware page level. So it's very hard to guess what
will be written and what not on disk in case of a crash.

Even if we switch to something different than our filesystem-based
approach we need something more consistent than this. It's fine for
mongo as it's the unique persistence offered and must be both fast and
persistence, and mongodb guys tried to find some tradeoff. But in our
case for speed we have the in memory back end, and when diskstore is
enabled we have the object cache, so my feeling is that our on-disk KV
implementation should be able to provide some more durability.

Baishampayan Ghose

unread,
Jan 5, 2011, 4:10:54 AM1/5/11
to redi...@googlegroups.com
>> As an aside, if Redis stores the value pointed by each key in its own
>> file, what are the implications of using filesystems which limit the
>> total number of files?
>
> The implication is that you can only store so many keys on disk. You aren't
> using fat-12, are you? ;)

Not at all :)

Tim Lossen

unread,
Jan 5, 2011, 4:28:55 AM1/5/11
to redi...@googlegroups.com
On 2011-01-04, at 20:57 , Salvatore Sanfilippo wrote:

> So as a
> reaction to the email sent by Tim about his unhappiness with VM I used
> a few vacation days to start implementing a new model, that is was was
> listed above as number "3".
>
> The new set of tradeoffs are very different. The result is called
> diskstore

excellent news! also, "diskstore" sounds better than "durable vm" :)
i will try to give it a spin later today.

tim

--
http://tim.lossen.de

Baishampayan Ghose

unread,
Jan 5, 2011, 4:32:10 AM1/5/11
to redi...@googlegroups.com
>> As an aside, if Redis stores the value pointed by each key in its own
>> file, what are the implications of using filesystems which limit the
>> total number of files?
>
> Most filesystems have limit in the max number of files per directory,
> this is why we use 65536 directories, so there is no such a problem.
>
> Mmap is faster and will provide use a single file database, but the
> problem is, it's very hard to avoid corruption with this schema. For
> instance in MongoDB if I'm correct in order to have a decent amount of
> durability you need to run a replica, otherwise something bad can
> happen.
>
> This is since mmaped files will flush things on disk without any
> ordering, and at hardware page level. So it's very hard to guess what
> will be written and what not on disk in case of a crash.
>
> Even if we switch to something different than our filesystem-based
> approach we need something more consistent than this. It's fine for
> mongo as it's the unique persistence offered and must be both fast and
> persistence, and mongodb guys tried to find some tradeoff. But in our
> case for speed we have the in memory back end, and when diskstore is
> enabled we have the object cache, so my feeling is that our on-disk KV
> implementation should be able to provide some more durability.

Thanks for the explanation, Salvatore. Redis rocks (we use it in
production) and I am very optimistic about its future :)

Dvir Volk

unread,
Jan 5, 2011, 5:36:18 AM1/5/11
to redi...@googlegroups.com
just a thought, Salvatore,
have you considered using berkeley db (sleepycat), or even mysql?
if the API is so simple, it's worth a shot giving poeple a couple of options for the "back engine" or something.
I will certainly play with it over the weekend and try to use my storage engine as an alternative, but I think it's worth also exploring BDB and even MYSQL (although I'm sure the network overhead will make it much slower).
the reason I didn't use BDB back then was its licensing.

Salvatore Sanfilippo

unread,
Jan 5, 2011, 5:51:11 AM1/5/11
to redi...@googlegroups.com
On Wed, Jan 5, 2011 at 11:36 AM, Dvir Volk <dvi...@gmail.com> wrote:
> just a thought, Salvatore,
> have you considered using berkeley db (sleepycat), or even mysql?
> if the API is so simple, it's worth a shot giving poeple a couple of options
> for the "back engine" or something.
> I will certainly play with it over the weekend and try to use my storage
> engine as an alternative, but I think it's worth also exploring BDB and even
> MYSQL (although I'm sure the network overhead will make it much slower).
> the reason I didn't use BDB back then was its licensing.

Supporting multiple back ends surely makes sense. On the other hand I
want to have a clean build system with a clean default, so this is
what we can do.

For default we'll have something like:

diskstore-plugin native

Or...

diskstore-plugin /path/to/your/plugin

This plugin will just be popen()-ed by Redis, so Redis will talk to
this plugin via standard input/output, and will issue commands like
GET / SET / and DEL, with a simple binary protocol. The plugin will
reply writing to standard output, again with a simple binary protocol.

Given that anyway disk I/O is involved the overhead of the stdin/out
chat is not a big issue, and this way people can create plugins for
diskstore with any conceivable language and architecture. At the same
time Redis will ship what a default that we consider sane for most
users.

Makes sense?

Tim Lossen

unread,
Jan 5, 2011, 5:53:36 AM1/5/11
to redi...@googlegroups.com
i think there are even more interesting options -- how about
something clustered, like riak for example? this would take care
of storing multiple copies, thus guarding against disk failures,
and making failover much easier.

--
http://tim.lossen.de

Pedro Melo

unread,
Jan 5, 2011, 6:14:14 AM1/5/11
to redi...@googlegroups.com
Hi,

On Wed, Jan 5, 2011 at 10:36 AM, Dvir Volk <dvi...@gmail.com> wrote:
> just a thought, Salvatore,
> have you considered using berkeley db (sleepycat), or even mysql?

Or the InnoDB plugin directly, given that it seems to have good performance:

http://yoshinorimatsunobu.blogspot.com/2010/10/using-mysql-as-nosql-story-for.html

But given that the diskstore will be able to use different back-ends
via plugins, I assume that if someone really wants this, they will
appear.

Bye,
--
Pedro Melo
http://www.simplicidade.org/
xmpp:me...@simplicidade.org
mailto:me...@simplicidade.org

Dvir Volk

unread,
Jan 5, 2011, 6:41:52 AM1/5/11
to redi...@googlegroups.com
I'm not sure what would be the loss of using a pipe, it would be nice to benchmark it.
My instinct leans towards having this loaded as a plugin written in C and compiled as a .so
so you can have the configuration file look like:

diskstore-plugin sleepycat.so

but this might make things too complex at the price of little performance gains.
So the question is how much performance are you losing by doing this externally via a pipe?


Salvatore Sanfilippo

unread,
Jan 5, 2011, 6:46:31 AM1/5/11
to redi...@googlegroups.com
On Wed, Jan 5, 2011 at 12:41 PM, Dvir Volk <dvi...@gmail.com> wrote:

> So the question is how much performance are you losing by doing this
> externally via a pipe?

I think there is not even need to measure it since we receive commands
via a socket, so the redis-plugin link will be for sure in the same
scale (much faster actually). And probably most of this plugins may
end talking with some other networking layer like MySQL or Riak or
others.

For sure dynamic lib is more efficient, but everything considered the
flexibility of using the pipe is probably a bigger advantage.

Cheers,
Salvatore

Dvir Volk

unread,
Jan 5, 2011, 6:50:15 AM1/5/11
to redi...@googlegroups.com
you could leave both options available :)
if we specify a .so plugin it will be a .so, otherwise a pipe
but I guess a pipe only will be fine 

Salvatore Sanfilippo

unread,
Jan 5, 2011, 6:55:45 AM1/5/11
to redi...@googlegroups.com
On Wed, Jan 5, 2011 at 12:50 PM, Dvir Volk <dvi...@gmail.com> wrote:
> you could leave both options available :)
> if we specify a .so plugin it will be a .so, otherwise a pipe
> but I guess a pipe only will be fine

The reason I'm not so happy with the .so thing is that I never saw a
software with a plugin system where a great deal of the plugins were
in a sad state. And this in the case of a database is a serious
issue... Imagine this C-written plugins around on github, at different
levels of completion... does not sound too good to me :)

With the pipe we can have much stabler plugins, isolated form Redis so
that if the pipe is closed Redis will just report "Hey your plugin
crashed, use something better. It's an important gain...

We'll be always in time to fix this later. If a plugin will become
dominant, we can implement it as a built-in option perhaps.

Santiago Perez

unread,
Jan 5, 2011, 8:42:17 AM1/5/11
to redi...@googlegroups.com
Salvatore, this sounds amazing! This exactly what we were needing for redis to cover all our storage needs! Really looking forward to seeing this become a reality, and eager to help in anyway I can.

Is there a branch on github where we can get this to start playing with it?

BTW, I love the storage backend plugin idea. Is it correct to stipulate that a plugin could provide point-in-time persistence instead of per key or is this possibility off the table prior to reaching the plugin?

Excellent work!
Santiago


Fabien

unread,
Jan 5, 2011, 2:11:37 PM1/5/11
to Redis DB
Hi,

We plan to use Redis in a FUSE filesystem which will act as a LRU
cache for a remote one (something like FSCACHE/CACHEFS for NFS)

More details here about the use case here :
http://stackoverflow.com/questions/4557825/is-it-possible-to-have-a-linux-vfs-cache-with-a-fuse-filesystem

This new diskstore seems to be very exciting for us in this particular
use case.

First, we planned to :
- store metadatas in a "all in memory" redis instance
- store cache pages in a standard local filesystem (as simple files)

But, because the remote filesystem doesn't fit in the local one, what
about the LRU cleaning of the cache pages in the local filesystem ?

So, we are excited about this diskstore way and we are thinking about
using it to store all the cache stuff (metadatas and cache pages).

But I have three little questions :

- Is it a silly idea ?
- I have noticed the new option "cache-max-memory" but, in this case,
is the "maxmemory" option will limit the dataset (on disk) total
size ? (in other words, is it possible to use the "diskstore way" as a
LRU cache with volatile keys)
- Does this new diskstore change redis performances about storing
medium sized values (like 100 KB)

Regards,

Fabien

PS : we already use Redis as a "standard LRU cache" and as a "network
software bus" (lpush/brpop pattern) in a multiple producers/consummers
pattern and... it just rocks ! Thanks !


Tim Lossen

unread,
Jan 6, 2011, 2:50:37 AM1/6/11
to redi...@googlegroups.com
in fact, the longer i think about it -- wouldn't diskstore + sharding
give us most
of the benefits of redis cluster, at a fraction of the complexity?

i see two main scenarios:

a) multiple redis instances on the same machine, sharing the same
diskstore.
this should work today, although i have not yet tested it.

b) multiple redis instance on different machines, sharing the same
(remote
or clustered) diskstore backend -- NFS, cluster filesystem, riak,
cassandra etc.
this should work as soon as salvatore has implemented the pluggable
backend
part.

now that diskstore has removed "total size of dataset" as bottleneck,
both a)
and b) address "throughput" as the next likely bottleneck. in
addition, b) makes
diskstore viable if the working dataset is to large for a single
machine.

if a) turns out to be useful (disk i/o might become the bottleneck
here), it could
even be made into a diskstore configuration option, telling redis to
fork n
processes on consecutive ports.

tim

> http://tim.lossen.de


>
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com
> .
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en
> .
>

--
http://tim.lossen.de

Josiah Carlson

unread,
Jan 6, 2011, 3:44:58 AM1/6/11
to redi...@googlegroups.com
On Wed, Jan 5, 2011 at 11:50 PM, Tim Lossen <t...@lossen.de> wrote:
in fact, the longer i think about it -- wouldn't diskstore + sharding give us most
of the benefits of redis cluster, at a fraction of the complexity?

In my opinion, the only reason to use redis cluster instead of manually sharding redis, is the handling of failover with auto-replication of data.
 
i see two main scenarios:

a) multiple redis instances on the same machine, sharing the same diskstore.
this should work today, although i have not yet tested it.

As long as you mean shards here, that would work fine as long as your client works correctly.

b) multiple redis instance on different machines, sharing the same (remote
or clustered) diskstore backend -- NFS, cluster filesystem, riak, cassandra etc.
this should work as soon as salvatore has implemented the pluggable backend
part.

Possible, yes. Scalable? Not any more so than Redis by itself. Riak, Cassandra, NFS, cluster filesystems, etc., aren't magic infrastructure that remove disk IO limits, they just distribute those limits out to more machines. Turns out it's very likely within a few percent of the exact same limits that Redis would have if it were just running on all of those machines anyways (though Redis would probably do a little better, thanks to it's lack of needing to update an additional index structure).

Then again, S3 as a storage backend would be almost perfect (though some work-arounds would be necessary for the high-latency + high-concurrency writes that S3 can offer).

Regards,
 - Josiah

To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.



--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.

--

http://tim.lossen.de



--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.

For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.


--

http://tim.lossen.de

--
You received this message because you are subscribed to the Google Groups "Redis DB" group.
To post to this group, send email to redi...@googlegroups.com.
To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.

Xiangrong Fang

unread,
Jan 6, 2011, 3:47:02 AM1/6/11
to redi...@googlegroups.com
2011/1/6 Josiah Carlson <josiah....@gmail.com>


On Wed, Jan 5, 2011 at 11:50 PM, Tim Lossen <t...@lossen.de> wrote:
in fact, the longer i think about it -- wouldn't diskstore + sharding give us most
of the benefits of redis cluster, at a fraction of the complexity?

In my opinion, the only reason to use redis cluster instead of manually sharding redis, is the handling of failover with auto-replication of data.

+1

Dvir Volk

unread,
Jan 6, 2011, 4:07:46 AM1/6/11
to redi...@googlegroups.com
On Thu, Jan 6, 2011 at 10:44 AM, Josiah Carlson <josiah....@gmail.com> wrote:

On Wed, Jan 5, 2011 at 11:50 PM, Tim Lossen <t...@lossen.de> wrote:
in fact, the longer i think about it -- wouldn't diskstore + sharding give us most
of the benefits of redis cluster, at a fraction of the complexity?

In my opinion, the only reason to use redis cluster instead of manually sharding redis, is the handling of failover with auto-replication of data.

Having a shared diskstore provides auto replication of data - every time a key get hit on any shard, it's being read from diskstore.
this also provides automatic failure handling if you take out a node from the ring and change the number of nodes.
 

Tim Lossen

unread,
Jan 6, 2011, 5:52:46 AM1/6/11
to redi...@googlegroups.com
hmmm .... using the 'unstable' branch from git, i am getting
segfaults almost immediately. i am trying to import a million
hashes, but redis crashes on the third one:

> ds_enabled:1
> role:master
> cache_max_memory:8589934592
> cache_blocked_clients:0
> db0:keys=2,expires=0
>
> [12120] 06 Jan 11:43:41 # src/redis-server(_redisPanic+0x7a) [0x42c19a]
> [12120] 06 Jan 11:43:41 # src/redis-server(_redisPanic+0x7a) [0x42c19a]
> [12120] 06 Jan 11:43:41 # src/redis-server(dsSet+0x184) [0x42f0a4]
> [12120] 06 Jan 11:43:41 # src/redis-server(IOThreadEntryPoint+0x1a6) [0x42a8c6]
> [12120] 06 Jan 11:43:41 # /lib/libpthread.so.0(+0x69ca) [0x7f1d7c2719ca]
> [12120] 06 Jan 11:43:41 # /lib/libc.so.6(clone+0x6d) [0x7f1d7bfce70d]
> Segmentation fault

am i doing something wrong here?


On 2011-01-05, at 10:28 , Tim Lossen wrote:

> i will try to give it a spin later today.

--
http://tim.lossen.de

Salvatore Sanfilippo

unread,
Jan 6, 2011, 6:44:53 AM1/6/11
to redi...@googlegroups.com
Hello Tim, please can you most more output?

dsSet() currently panics when it is not able to write on disk (for
instance disk full or alike).
But the exact error message is a few lines above.

Cheers,
Salvatore

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>

--
Salvatore 'antirez' Sanfilippo
open source developer - VMware

Salvatore Sanfilippo

unread,
Jan 6, 2011, 6:48:18 AM1/6/11
to redi...@googlegroups.com
On Wed, Jan 5, 2011 at 2:42 PM, Santiago Perez <san...@santip.com.ar> wrote:
> Salvatore, this sounds amazing! This exactly what we were needing for redis
> to cover all our storage needs! Really looking forward to seeing this become
> a reality, and eager to help in anyway I can.

Thanks Santiago,

> Is there a branch on github where we can get this to start playing with it?

yes, the "unstable" branch

> BTW, I love the storage backend plugin idea. Is it correct to stipulate that
> a plugin could provide point-in-time persistence instead of per key or is
> this possibility off the table prior to reaching the plugin?

No, the per-key consistency is up to the caching system that will feed
data to the disk KV store.
So the plugin can't change that.

But diskstore will support BGSAVE soon, so you can have point in time
persistence as snapshots.
And BGSAVE will be super fast with diskstore, nothing like BGSAVE + VM.

Cheers,
Salvatore

--
Salvatore 'antirez' Sanfilippo


open source developer - VMware

http://invece.org

Salvatore Sanfilippo

unread,
Jan 6, 2011, 6:55:44 AM1/6/11
to redi...@googlegroups.com
On Wed, Jan 5, 2011 at 8:11 PM, Fabien <fabien...@gmail.com> wrote:

> So, we are excited about this diskstore way and we are thinking about
> using it to store all the cache stuff (metadatas and cache pages).

In general diskstore will work well in two cases:

1) You have a lot of objects. You have a mostly-read workload. The
majority of the objects that you use very often can fit your memory.
You need speed.
2) You don't need so much speed and can live with the performances
that the I/O system is able to deliver (for instance form 100 to 1000
ops per second), you have a very large dataset with the "working set"
that does not fit RAM at all.

Basically it's up to you to trade speed for memory used.

Instead there is a case where diskstore or any other approach to Redis
on-disk will hardly work, that is the following:

3) You have very large aggregate types. Like lists, sets, or sorted
sets that are mostly > 10000 elements.

> But I have three little questions :
>
> - Is it a silly idea ?
> - I have noticed the new option "cache-max-memory" but, in this case,
> is the "maxmemory" option will limit the dataset (on disk) total
> size ? (in other words, is it possible to use the "diskstore way" as a
> LRU cache with volatile keys)

This is just the size of your cache. Up to that size objects will be
taken in memory.
So it's exactly like a cache, but your dataset can be any size.

> - Does this new diskstore change redis performances about storing
> medium sized values (like 100 KB)

A lot, in memory is much faster, anyway. Especially when your values are large.

Cheers,
Salvatore

>
> Regards,
>
> Fabien
>
> PS : we already use Redis as a "standard LRU cache" and as a "network
> software bus" (lpush/brpop pattern) in a multiple producers/consummers
> pattern and... it just rocks ! Thanks !
>
>

> --
> You received this message because you are subscribed to the Google Groups "Redis DB" group.
> To post to this group, send email to redi...@googlegroups.com.
> To unsubscribe from this group, send email to redis-db+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/redis-db?hl=en.
>
>

--

Salvatore Sanfilippo

unread,
Jan 6, 2011, 7:02:39 AM1/6/11
to redi...@googlegroups.com
On Thu, Jan 6, 2011 at 8:50 AM, Tim Lossen <t...@lossen.de> wrote:
> in fact, the longer i think about it -- wouldn't diskstore + sharding give
> us most
> of the benefits of redis cluster, at a fraction of the complexity?

I don't think so, more later

> i see two main scenarios:
>
> a) multiple redis instances on the same machine, sharing the same diskstore.
> this should work today, although i have not yet tested it.

This works perfectly as long as you don't ask for the same keys to the
same instances, and you don't use BGSAVE. It's an interesting idea to
explore but hardly a cluster.

> b) multiple redis instance on different machines, sharing the same (remote
> or clustered) diskstore backend -- NFS, cluster filesystem, riak, cassandra
> etc.
> this should work as soon as salvatore has implemented the pluggable backend
> part.

As with "a" IMHO.

The cluster is cool since every node is self contained, can hold a
given percentage of keys and you can change this dinamically while the
system is running, you can add and remove nodes while the system is
running, and there is no single node of failure. Moreover it is
horizontally scalable as there is no node-to-node communication at all
in order to execute queries.

So in order to have a distributed, horizontally scalable, dynamic
(add/remove nodes), and fault tolerant system, we really need cluster.
Anyway cluster is really useful for our base use case that is
in-memory system.

While I don't agree with you that cluster is no longer necessary, I
think you have a point, in the sense that diskstore will *reduce* the
use case of Redis cluster for all the people that needed a cluster not
for speed but since a single box was not able to hold all their data,
even if they don't have so much speed requirements, or have an highly
biased dataset. But this guys already had other alternatives with
other data stores.

Instead who really need redis cluster, with tens of thousands of
queries per second in every node, and against possibly large and
dynamic data structures like sorted sets, can be well served only by
redis-cluster itself :)

Thanks for the feedback.

Cheers,
Salvatore

--
Salvatore 'antirez' Sanfilippo


open source developer - VMware

http://invece.org

Tim Lossen

unread,
Jan 6, 2011, 9:39:12 AM1/6/11
to redi...@googlegroups.com
> [18622] 06 Jan 15:24:09 # diskstore error opening /data/diskstore/a5/ca/0_a5ca123d2adbe61029e683ccc49ef36f418de1ec_1294323849_32675832: No such file or directory


ok, never mind, i understand what went wrong -- i deleted
the contents of /data/diskstore before starting redis.
in this case, the hex folders are *not* created. when i
remove /data/diskstore, on the other hand, everything works.

BTW, "cache-flush-delay" is in seconds, right?

--
http://tim.lossen.de

Santiago Perez

unread,
Jan 6, 2011, 9:40:28 AM1/6/11
to redi...@googlegroups.com
No, the per-key consistency is up to the caching system that will feed
data to the disk KV store.
So the plugin can't change that.

But diskstore will support BGSAVE soon, so you can have point in time
persistence as snapshots.
And BGSAVE will be super fast with diskstore, nothing like BGSAVE + VM.

Any chance of supporting AOF alongside with diskstore? I know it would reduce write performance to half but it would give much more up-to-date point in time persistence while still allowing for faster controlled restarts while still allowing to recover almost the entire dataset (point in time) in case of a crash.

Is it correct to assume that if redis is properly stopped with the QUIT command the end result on the diskstore will be a point in time view?

Nate Lawson

unread,
Jan 6, 2011, 9:43:39 AM1/6/11
to Redis DB
On Jan 5, 2:36 am, Dvir Volk <dvir...@gmail.com> wrote:
> just a thought, Salvatore,
> have you considered using berkeley db (sleepycat), or even mysql?
> if the API is so simple, it's worth a shot giving poeple a couple of options
> for the "back engine" or something.
> I will certainly play with it over the weekend and try to use my storage
> engine as an alternative, but I think it's worth also exploring BDB and even
> MYSQL (although I'm sure the network overhead will make it much slower).
> the reason I didn't use BDB back then was its licensing.

This is a funny discussion since I hacked out a rough prototype of
this the other day. I agree there are many reasons to use a disk-
backed store larger than RAM. I like Redis' data model and think it
will be better to support multiple backends.

Here is a diff against redis-2.0.4 that adds Berkeley DB 4.2 support
for SET/GET. It is a rough prototype for seeing what internal API
changes are needed to support a pluggable backend. It definitely has
memory leaks since it has to work with robj values and does not do
that right yet. It has only been tested on FreeBSD.

http://www.root.org/~nate/freebsd/redis-berkdb.diff

The way it works is it creates 16 berk dbs via the dict interface. It
then does GETs/SETs. It does not yet support serialization of zipmaps,
etc. so other data types are disabled.

As you can see, I had to expose some of the robj internals to be able
to access the data and the createStringObject() fn. It would be very
nice if there was an API for dict to work with robj values instead of
just void* being passed as key/value. Also, having length parameters
for key/value would help. This would possibly help avoid memcpy()
overhead. We also need dictCreate to get the dict type/number. I faked
that with the dbId static.

The current expires mechanism does not work well with a disk-based
backend because it depends on fast lookup of keys. Also, the pubsub
stuff doesn't fit well. I commented them out in this version since
they are a harder problem to solve here.

I actually don't like the proposal of adding another protocol between
Redis and backend plugins. Just because the calls eventually hit disk
doesn't mean we need more overhead in-between. For example, the per-
command latency would go up for this (de)serialization of data, making
accessing large values slower. Since Redis is async, this will slow
down other transactions as well.

I don't see any reason why the plugin interface can't be C calls via a
set of function pointers. The protocol overhead should be reserved for
the public API that clients use and not added to backends.

Let me know if there are questions. I hope this patch will help rework
Redis internals to support disk backends and storage larger than RAM.

Thanks,
Nate

Josiah Carlson

unread,
Jan 6, 2011, 11:40:05 AM1/6/11
to redi...@googlegroups.com
To rephrase what I said to cause less confusion (because I don't see how your reply would follow mine):
The reason to switch to redis cluster (which is not yet out) away from a standard sharded redis diskstore (which seems poised to be released soon), is that redis cluster will have failover for the case when the master shard dies, will rebalance, etc.

Given the diskstore version of redis that Salvatore has described, it won't work as you describe until clustering is integrated, and even then, it won't. The race conditions surrounding reading/writing to/from a single path on disk (or remote file system) will make a single shared disk for multiple nodes in a cluster a non-starter (think corruption, old values, segfaults, etc.).

Regards,
 - Josiah

Salvatore Sanfilippo

unread,
Jan 6, 2011, 12:43:42 PM1/6/11
to redi...@googlegroups.com
On Thu, Jan 6, 2011 at 3:39 PM, Tim Lossen <t...@lossen.de> wrote:
>> [18622] 06 Jan 15:24:09 # diskstore error opening /data/diskstore/a5/ca/0_a5ca123d2adbe61029e683ccc49ef36f418de1ec_1294323849_32675832: No such file or directory
>
>
> ok, never mind, i understand what went wrong -- i deleted
> the contents of /data/diskstore before starting redis.
> in this case, the hex folders are *not* created. when i
> remove /data/diskstore, on the other hand, everything works.

Exactly Tim, it makes sense to check that there is at least 00/00 and
ff/ff before starting even if the main directory is found. This way we
can make sure that the previous launch of Redis was able to create all
the dirs (given that they are created in this order), otherwise we can
exit with a more decent error than a redisPanic()...

> BTW, "cache-flush-delay" is in seconds, right?

Yes, when it's set to 0 it means: flush ASAP.
But ASAP does not mean that the flush is performed in a synchronous
way. There is plan to also add sync flush. It's slow but a few users
may want it.

Cheers,
Salvatore

Nate Lawson

unread,
Jan 6, 2011, 2:09:30 PM1/6/11
to Redis DB
On Jan 6, 6:43 am, Nate Lawson <n...@root.org> wrote:
> On Jan 5, 2:36 am, Dvir Volk <dvir...@gmail.com> wrote:
>
> > just a thought, Salvatore,
> > have you considered using berkeley db (sleepycat), or even mysql?
> > if the API is so simple, it's worth a shot giving poeple a couple of options
> > for the "back engine" or something.
> > I will certainly play with it over the weekend and try to use my storage
> > engine as an alternative, but I think it's worth also exploring BDB and even
> > MYSQL (although I'm sure the network overhead will make it much slower).
> > the reason I didn't use BDB back then was its licensing.
>
> This is a funny discussion since I hacked out a rough prototype of
> this the other day. I agree there are many reasons to use a disk-
> backed store larger than RAM. I like Redis' data model and think it
> will be better to support multiple backends.
>
> Here is a diff against redis-2.0.4 that adds Berkeley DB 4.2 support
> for SET/GET. It is a rough prototype for seeing what internal API
> changes are needed to support a pluggable backend. It definitely has
> memory leaks since it has to work with robj values and does not do
> that right yet. It has only been tested on FreeBSD.
>
> http://www.root.org/~nate/freebsd/redis-berkdb.diff

It would also be great to see a Redis MySQL plugin, similar to this
one which speaks the memcached protocol:

http://yoshinorimatsunobu.blogspot.com/2010/10/using-mysql-as-nosql-story-for.html

Nate Lawson

unread,
Jan 6, 2011, 2:13:21 PM1/6/11
to Redis DB
On Jan 5, 3:50 am, Dvir Volk <dvir...@gmail.com> wrote:
> you could leave both options available :)
> if we specify a .so plugin it will be a .so, otherwise a pipe
> but I guess a pipe only will be fine

I'd like to see the .so be the plugin interface (or even .o to prevent
versioning problems of external .so libs). However, Redis could ship
with a socket-plugin backend as the default that does what Salvatore
suggests.

Either way, the issues with the builtin dict I mentioned in the
previous
email will need to be addressed.

Thanks,
Nate

Nate Lawson

unread,
Jan 6, 2011, 2:15:44 PM1/6/11
to Redis DB
On Jan 5, 3:55 am, Salvatore Sanfilippo <anti...@gmail.com> wrote:
> On Wed, Jan 5, 2011 at 12:50 PM, Dvir Volk <dvir...@gmail.com> wrote:
> > you could leave both options available :)
> > if we specify a .so plugin it will be a .so, otherwise a pipe
> > but I guess a pipe only will be fine
>
> The reason I'm not so happy with the .so thing is that I never saw a
> software with a plugin system where a great deal of the plugins were
> in a sad state. And this in the case of a database is a serious
> issue... Imagine this C-written plugins around on github, at different
> levels of completion... does not sound too good to me :)
>
> With the pipe we can have much stabler plugins, isolated form Redis so
> that if the pipe is closed Redis will just report "Hey your plugin
> crashed, use something better. It's an important gain...
>
> We'll be always in time to fix this later. If a plugin will become
> dominant, we can implement it as a built-in option perhaps.

I think that's more of a project management problem, not a fundamental
problem of using a shared obj file. If stable versions of Berkeley DB,
InnoDB, etc. plugins ship along with Redis itself, there won't be
version drift.

If the plugin interface is going to change too often for a .so, it
will
be changing for a socket protocol also. Again, this is a problem with
how a project is run, not an inherent problem in how the data is
marshaled between components.

-Nate

Jak Sprats

unread,
Jan 6, 2011, 7:25:53 PM1/6/11
to Redis DB
Nate,

+ db_key.data = k->ptr;
+ db_key.size = strlen(k->ptr);

be careful, robj's can be sds'es and long's, check the robj->encoding
additionally, if a robj is a sds, use sdslen() to avoid a pass over
the string
(obviously not the case if these ptrs are YOUR robj's, they may be
strings everytime ... bit confused)

There is a trick if your particular layer of storage does not need the
meta-data stored in a robj: dont store robj's just ptr's.
1.) in dbp->put() store only YOUR ptr
2.) abstract the creation of a robj into dbp->get()
This makes the robj virtual (i.e. it is not stored, it is created when
needed, meaning you negate all robj state, but you can save loads of
bytes this way)

if you want to discuss this more, we can do on a seperate thread,
cause this is off-topic for this one

- Jak

p.s. happy that someone is hacking the C :)

Jak Sprats

unread,
Jan 6, 2011, 7:35:57 PM1/6/11
to Redis DB
>> An important variable here is the filesystem in use for sure.

Is there a recommended filesystem for diskstore (btrfs, ext4, etc...)
and conversely are there filesystem's that are bound to have
performance issues w/ diskstore?

best to get this down in writing before people start coming back w/
"diskstore is lame on my fat-12 filesystem" :)

I think VM got negative press (way back) because of OSX filesystem
issues.

On Jan 4, 3:15 pm, Salvatore Sanfilippo <anti...@gmail.com> wrote:
> > On Tue, Jan 4, 2011 at 9:57 PM, Salvatore Sanfilippo <anti...@gmail.com>
> >> Cheers,
> >> Salvatore
>
> >> --
> >> Salvatore 'antirez' Sanfilippo
> >>http://invece.org
>
> >> "We are what we repeatedly do. Excellence, therefore, is not an act,
> >> but a habit." -- Aristotele
>
> >> --
> >> You received this message because you are subscribed to the Google Groups
> >> "Redis DB" group.
> >> To post to this group, send email to redi...@googlegroups.com.
> >> To unsubscribe from this group, send email to
> >> redis-db+u...@googlegroups.com.
> >> For more options, visit this group at
> >>http://groups.google.com/group/redis-db?hl=en.
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "Redis DB" group.
> > To post to this group, send email to redi...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > redis-db+u...@googlegroups.com.
> > For more options, visit this group at
> >http://groups.google.com/group/redis-db?hl=en.
>
> --
> Salvatore 'antirez' Sanfilippohttp://invece.org

Tim Lossen

unread,
Jan 7, 2011, 4:32:16 AM1/7/11
to redi...@googlegroups.com
On 2011-01-06, at 13:02 , Salvatore Sanfilippo wrote:
> On Thu, Jan 6, 2011 at 8:50 AM, Tim Lossen <t...@lossen.de> wrote:
>> in fact, the longer i think about it -- wouldn't diskstore + sharding give
>> us most
>> of the benefits of redis cluster, at a fraction of the complexity?
>
> I don't think so, [...]

> in order to have a distributed, horizontally scalable, dynamic
> (add/remove nodes), and fault tolerant system, we really need cluster.


salvatore, let me explain in more detail what i mean.

what i have in mind is a two-tiered storage system, where the
frontend tier consists of M sharded redis instances, and the
backend tier consists of N clustered storage nodes, riak for
example.

there is a clear separation of concerns between the two
tiers: the frontend is responsible for data structures,
performance, throughput (where redis really shines), and the
backend is responsible for capacity, persistence and
durability (where other systems are a better fit).

this setup can easily be scaled horizontally, even in two
dimensions: if we run out of redis cpu or redis memory, we
increase M, if we run out of disk io or disk space, we
increase N.

to change the number of redis instances, we shut them down
(flushing all dirty data to the backend storage), reconfigure
the clients, and start them up again. this means a short
downtime.

failover works basically the same -- if a redis node goes
down, we simply fire up another one to take over its shard.
of course, we lose a little bit of unflushed data in this
case.

if a riak node goes down, on the other hand, this is
completely transparent to the application and we should not
lose any data, as it will be replicated (possibly even more
than once) inside the backend tier.

> While I don't agree with you that cluster is no longer necessary, I
> think you have a point, in the sense that diskstore will *reduce* the
> use case of Redis cluster

exactly, that is the main point i wanted to make. although
redis cluster may be superior, the above setup will be "good
enough" for a lot of use cases. and by outsourcing all the
cluster complexity to somebody else, we can get it for free
(and pretty soon).

personally, i think the clustering problem has already been
solved well enough by others, and that redis could add more
value by concentrating on its own unique strengths instead.

cheers
tim

--
http://tim.lossen.de

Nate Lawson

unread,
Jan 9, 2011, 3:01:04 PM1/9/11
to Redis DB
On Jan 6, 4:25 pm, Jak Sprats <jakspr...@gmail.com> wrote:
> Nate,
>
> +    db_key.data = k->ptr;
> +    db_key.size = strlen(k->ptr);
>
> be careful, robj's can be sds'es and long's, check the robj->encoding
> additionally, if a robj is a sds, use sdslen() to avoid a pass over
> the string
> (obviously not the case if these ptrs are YOUR robj's, they may be
> strings everytime ... bit confused)
>
> There is a trick if your particular layer of storage does not need the
> meta-data stored in a robj: dont store robj's just ptr's.
> 1.) in dbp->put() store only YOUR ptr
> 2.) abstract the creation of a robj into dbp->get()
> This makes the robj virtual (i.e. it is not stored, it is created when
> needed, meaning you negate all robj state, but you can save loads of
> bytes this way)
>
> if you want to discuss this more, we can do on a seperate thread,
> cause this is off-topic for this one

These are all good points. As I said, this is only a prototype to see
what parts of Redis were not friendly to adding a new storage backend.
I hope Salvatore finds it useful as he works on diskstore.

I don't plan to continue it further. For our production use, I've gone
to a different database with Redis kept in the places it currently is
best: data structures for workflow and fast memory-only cache.

Thanks,
Nate
Reply all
Reply to author
Forward
0 new messages